go-ruby-regexp

Onigmo in pure Go โ€” Ruby's regexp engine, with the features RE2 leaves out, no cgo.

pure Go ยท zero cgo backtracking VM backreferences lookahead / lookbehind possessive / atomic subexpression calls \p{โ€ฆ} ยท POSIX classes UTF-8 / ASCII-8BIT ReDoS-guarded optimizer prefilters Ruby-compatible
Documentation GitHub
Documentation (MkDocs Material + mike) License: BSD-3-Clause Go 1.26.4+ Engine roadmap: Phases 0โ€“4 complete

go-ruby-regexp is a pure-Go (no cgo) reimplementation of Onigmo, the regular-expression engine used by Ruby. Go's standard regexp is RE2 โ€” linear-time but without backreferences or lookaround, and with different match semantics โ€” so a byte-compatible Ruby regexp needs a backtracking engine. go-ruby-regexp is that engine: a faithful backtracking VM, hardened against catastrophic backtracking with memoization and a deterministic time/step budget. It is standalone and reusable, and is the regexp backend for go-embedded-ruby. The engine roadmap (Phases 0โ€“4) is complete โ€” backreferences, lookaround, possessive/atomic quantifiers, recursive subexpression calls, \p{โ€ฆ} and POSIX classes, rune-level /i folding, UTF-8 / ASCII-8BIT encodings, and a transparent optimizer prefilter โ€” differential-tested against MRI, 100% coverage, CI green across 6 arches.

Phase 0 โ€” Scanner, parser & VM ready

Onigmo syntax scanner + parser โ†’ AST, a bytecode compiler, and a backtracking VM: literals, classes, . * + ? {m,n}, groups, alternation, anchors, with captures.

Phase 1 โ€” Groups & quantifier modes ready

Named groups (?<name>โ€ฆ), backreferences \1 / \k<name>, and every quantifier mode โ€” greedy, lazy *? +? ??, possessive *+ ++ ?+, and atomic groups (?>โ€ฆ).

Phase 2 โ€” Lookaround & calls ready

Lookahead (?=โ€ฆ) (?!โ€ฆ), fixed/bounded-width lookbehind (?<=โ€ฆ) (?<!โ€ฆ), the \G anchor, and recursive subexpression calls \g<name> / \g<0>.

Phase 3 โ€” Unicode & encodings ready

Unicode properties \p{โ€ฆ}, POSIX bracket classes [[:alpha:]], \h / \H, \R, rune-level /i case folding, inline flags (?imx), and UTF-8 / ASCII-8BIT multi-encoding with multibyte class members [รฉ] / [ร -รฏ].

Phase 4 โ€” ReDoS hardening & optimizer ready

(pc, sp) memoization, step budget, recursion cap and a wall-clock WithTimeout; a start-position / interior-literal prefilter (up to ~210ร— faster); a lazy-NFA + cached-DFA fast path that beats C Onigmo on literal/alternation/structured scans and pulls inner loops to ~1.6โ€“5ร— of C (email โ‰ˆ RE2); a benchmark suite.

Phase 5 โ€” Ruby surface planned

Downstream, not part of this engine module: the full Ruby Regexp / MatchData surface and replacement DSL live in the go-embedded-ruby adapter that consumes this engine.

A faithful backtracking VM in pure Go, cgo disabled, so it cross-compiles and embeds anywhere. It implements the Onigmo features RE2 omits โ€” backreferences, lookaround, possessive quantifiers, atomic groups, named groups, subexpression calls โ€” with Ruby's leftmost-first semantics, and is hardened against ReDoS with memoization and a deterministic budget (as Ruby โ‰ฅ3.2). Validated differentially against Onigmo/MRI. It is a standalone, reusable module and the regexp backend for the sibling org github.com/go-embedded-ruby.