Syntax & parser¶
The syntax package is the front of the pipeline: a scanner turns the
pattern text into tokens, and a parser builds an abstract syntax tree in
Onigmo's grammar. The AST is what the compiler lowers to
bytecode, so the parser's job is to capture every construct Ruby's regexps can
express — faithfully, including the cases RE2 omits.
Scanner¶
The scanner reads the pattern as bytes under a declared encoding (UTF-8 or
ASCII-8BIT) and emits tokens for literals, metacharacters, escape sequences,
group openers, quantifiers, and class brackets. Encoding matters here: a
character class or a \p{…} property is interpreted in terms of the pattern's
encoding, which the Encoding-keyed cursor in the
compile package selects.
The Onigmo grammar surface¶
The parser accepts the Onigmo construct set, which is considerably larger than RE2's. The main groups of syntax it must recognize:
Escapes and anchors¶
- Standard escapes and shorthand classes:
\d \w \sand negations. - Ruby anchors and escapes:
\A(start of string),\z(end of string),\Z(end, before a trailing newline),\G(match-start anchor),\h/\H(hex digit / non),\R(any line break).
Groups¶
- Named capture:
(?<name>…). - Non-capturing:
(?:…). - Atomic:
(?>…)— match the subpattern, then discard its internal backtrack points. - Inline flag scopes:
(?i:…),(?m)…, etc.
Quantifiers¶
- Greedy:
* + ? {m,n}— match as much as possible, give back on failure. - Lazy (non-greedy):
*? +? ?? {m,n}?— match as little as possible, take more on failure. - Possessive:
*+ ++ ?+— match greedily and never give back. (A trailing+on a{m,n}brace is not possessive but a stacked greedy repeat(a{m,n})+, matching Onigmo.)
Character classes and properties¶
- Bracketed classes
[...]with ranges and negation[^...]. - POSIX classes inside brackets:
[[:alpha:]],[[:digit:]], … (the 14 standard classes, positive and negated). - Unicode properties:
\p{L},\p{Nd},\P{…}— resolved by thecharsetpackage. A deliberate slice of the general categories (L N P S Z C), theLu Ll Lt Lm Lo Ndsubcategories, and the Onigmo POSIX-style aliases (Alpha Alnum Digit Space Upper Lower Word) is supported; script and block names (\p{Han}, …) and the one-letter\pLform are out of scope (see the Roadmap).
Flags¶
- Pattern-level options (case-insensitive, multiline, extended) and their inline scoped forms, threaded through the AST so the compiler can specialize matching.
Output¶
The parser produces a typed AST: concatenation, alternation, repetition (with greediness and possessiveness), captures (indexed and named), assertions (lookaround, anchors), and character-set nodes. Anything malformed is reported as a parse error rather than silently accepted — fuzzing the parser for crashes is part of the test plan (see Contributing).