Skip to content

Roadmap (phases)

go-ruby-regexp/regexp is grown test-first, one capability at a time, each phase differential-tested against Onigmo/MRI rather than built in isolation. There are six phases, 0 through 5. The standalone engine roadmap (Phases 0–4) is complete; Phase 5 (the Ruby surface) is downstream in the go-embedded-ruby adapter that consumes this engine, not part of this module.

Phase Name Goal Status
0 Scanner + parser + VM Scanner/parser for the common subset (literals, classes, . * + ? {m,n}, groups, alternation, anchors ^ $ \A \z), compiler, and a backtracking VM. Exit: anchored/greedy matching with captures matches MRI on a starter corpus. Done
1 Groups & quantifier modes Named groups (?<name>…), backreferences \1 / \k<name>, and every quantifier mode — greedy, lazy *? +? ?? {m,n}?, possessive *+ ++ ?+, and atomic groups (?>…). Done
2 Lookaround & calls Lookahead (?=…) (?!…), fixed/bounded-width lookbehind (?<=…) (?<!…), the \G anchor, and recursive subexpression calls \g<…> (\g<name> / \g<n> / \g<±n> / \g<0>). Done
3 Unicode & encodings \p{…} Unicode properties, POSIX bracket classes [[:alpha:]], \h / \H, \R, rune-level /i case folding, inline flags (?imx), and UTF-8 / ASCII-8BIT multi-encoding with multibyte class members [é] / [à-ï]. Done
4 ReDoS hardening & optimizer (pc, sp) memoization, a deterministic step budget, a recursion-depth cap, and a wall-clock WithTimeout; a transparent start-position / required-interior-literal prefilter (up to ~210× faster); a lazy-NFA + cached-DFA fast path that beats C Onigmo on literal/alternation/structured scans and pulls the inner loops to ~1.6–5× of C (email ≈ RE2); a benchmark suite. Done
5 Ruby surface The full Ruby Regexp/MatchData surface via the go-embedded-ruby adapter, and the replacement DSL (\1, \k<>, \&, blocks). Downstream — lives in the adapter, not this module. Downstream

Documented out-of-scope boundaries

These are deliberate divergences from MRI/Onigmo, recorded so the engine's surface is unambiguous:

  • Full/special case folding — only simple (1:1) folding is done; multi-char expansions (ßss) and locale rules (Turkish dotless-ı/İ) are out of scope, and backreference folding is ASCII-only.
  • \p{…} is a deliberate slice of categories/aliases (L N P S Z C, the Lu Ll Lt Lm Lo Nd subcategories, and Alpha Alnum Digit Space Upper Lower Word); script and block names (\p{Han}, …) and the one-letter \pL form are not accepted.
  • Encodings beyond UTF-8 / ASCII-8BIT (UTF-16/32, EUC-JP, Shift_JIS, …) are out of scope; a caller transcodes legacy/wide text to UTF-8 at the boundary.
  • Match offsets are byte offsets, whereas MRI reports character offsets — the two agree on matched text but not on the numeric span on multi-byte input.
  • Invalid UTF-8 decodes leniently (replacement rune, width 1) rather than raising as MRI does.
  • Variable-width lookbehind is rejected (a constant or bounded per-alternative width is required), matching Onigmo.

See the Architecture overview for the pipeline these phases build out, and ReDoS hardening for the Phase 4 threat model.