Roadmap (phases)¶
go-ruby-regexp/regexp is grown test-first, one capability at a time, each phase
differential-tested against Onigmo/MRI rather than built in isolation. There are
six phases, 0 through 5. The standalone engine roadmap (Phases 0–4) is
complete; Phase 5 (the Ruby surface) is downstream in the go-embedded-ruby
adapter that consumes this engine, not part of this module.
| Phase | Name | Goal | Status |
|---|---|---|---|
| 0 | Scanner + parser + VM | Scanner/parser for the common subset (literals, classes, . * + ? {m,n}, groups, alternation, anchors ^ $ \A \z), compiler, and a backtracking VM. Exit: anchored/greedy matching with captures matches MRI on a starter corpus. |
Done |
| 1 | Groups & quantifier modes | Named groups (?<name>…), backreferences \1 / \k<name>, and every quantifier mode — greedy, lazy *? +? ?? {m,n}?, possessive *+ ++ ?+, and atomic groups (?>…). |
Done |
| 2 | Lookaround & calls | Lookahead (?=…) (?!…), fixed/bounded-width lookbehind (?<=…) (?<!…), the \G anchor, and recursive subexpression calls \g<…> (\g<name> / \g<n> / \g<±n> / \g<0>). |
Done |
| 3 | Unicode & encodings | \p{…} Unicode properties, POSIX bracket classes [[:alpha:]], \h / \H, \R, rune-level /i case folding, inline flags (?imx), and UTF-8 / ASCII-8BIT multi-encoding with multibyte class members [é] / [à-ï]. |
Done |
| 4 | ReDoS hardening & optimizer | (pc, sp) memoization, a deterministic step budget, a recursion-depth cap, and a wall-clock WithTimeout; a transparent start-position / required-interior-literal prefilter (up to ~210× faster); a lazy-NFA + cached-DFA fast path that beats C Onigmo on literal/alternation/structured scans and pulls the inner loops to ~1.6–5× of C (email ≈ RE2); a benchmark suite. |
Done |
| 5 | Ruby surface | The full Ruby Regexp/MatchData surface via the go-embedded-ruby adapter, and the replacement DSL (\1, \k<>, \&, blocks). Downstream — lives in the adapter, not this module. |
Downstream |
Documented out-of-scope boundaries¶
These are deliberate divergences from MRI/Onigmo, recorded so the engine's surface is unambiguous:
- Full/special case folding — only simple (1:1) folding is done; multi-char
expansions (
ß→ss) and locale rules (Turkish dotless-ı/İ) are out of scope, and backreference folding is ASCII-only. \p{…}is a deliberate slice of categories/aliases (L N P S Z C, theLu Ll Lt Lm Lo Ndsubcategories, andAlpha Alnum Digit Space Upper Lower Word); script and block names (\p{Han}, …) and the one-letter\pLform are not accepted.- Encodings beyond UTF-8 / ASCII-8BIT (UTF-16/32, EUC-JP, Shift_JIS, …) are out of scope; a caller transcodes legacy/wide text to UTF-8 at the boundary.
- Match offsets are byte offsets, whereas MRI reports character offsets — the two agree on matched text but not on the numeric span on multi-byte input.
- Invalid UTF-8 decodes leniently (replacement rune, width 1) rather than raising as MRI does.
- Variable-width lookbehind is rejected (a constant or bounded per-alternative width is required), matching Onigmo.
See the Architecture overview for the pipeline these phases build out, and ReDoS hardening for the Phase 4 threat model.