Skip to content

Architecture overview

go-ruby-regexp/regexp is a compiler and a virtual machine. A pattern is parsed into an abstract syntax tree, the AST is lowered to a bytecode program, and that program is executed by a backtracking VM against the input to produce MatchData. The model is Onigmo's — a backtracking matcher — see Why a backtracking engine for the reasoning. The backtracker is the source of truth for every feature and for submatch extraction; a lazy-NFA + cached-DFA fast path (RE2-style) is layered over it for the capture/backref/lookaround-free is-match subset, where it now runs within ~1.6–5× of C Onigmo and beats it outright on literal/alternation/structured scans (see Performance).

The pipeline

pattern (string, encoding, flags)
   │  scanner / parser  → AST (Onigmo syntax)
   │  compiler          → bytecode program (opcodes for the backtracking VM)
   │  optimizer         → anchors, first-byte sets, literal prefixes, atomic cuts
program  ──►  VM (backtracking, memoized, budgeted)        ──►  MatchData
        └──►  lazy-NFA + cached-DFA fast path (is-match,    ──┘
              capture/backref/lookaround-free subset)

The two engines agree byte-for-byte: the DFA fast path answers whether a match exists (and its bounds) on the matchable subset, while the backtracking VM extracts the actual submatches and handles every feature outside that subset.

Each stage has a single responsibility:

  • scanner / parser turns the pattern text into an AST in Onigmo's grammar.
  • compiler lowers the AST into a bytecode program plus capture/group metadata.
  • optimizer annotates the program — anchors, first-byte sets, literal prefixes, atomic cuts — to skip impossible work.
  • VM executes the program with a backtrack stack, memoization, and a step budget, producing MatchData.

Packages

The engine (github.com/go-ruby-regexp/regexp) is organized as a chain of small packages mirroring the pipeline, plus the public API:

Package Responsibility
internal/syntax scanner + parser → AST; Onigmo grammar and escapes
internal/ast the typed AST node set the parser produces and the compiler consumes
internal/compile AST → VM program (instructions + capture/group metadata), and the Encoding-keyed cursor (UTF-8 / ASCII-8BIT)
internal/vm backtracking matcher: thread state, backtrack stack, memo, step/recursion budget, wall-clock timeout, the start-position / interior-literal prefilters, and the lazy-NFA + cached-DFA fast path (dfa.go / dfa_run.go) for the matchable is-match subset
internal/charset \p{…} Unicode property classification
regexp.go public API: Compile / CompileEnc, Match / MatchString, WithTimeout, Encoding, and MatchData (spans by index and name)

The detail pages cover the load-bearing pieces: Syntax & parser, the Backtracking VM, and ReDoS hardening.

Relationship to go-embedded-ruby

The engine is standalone: it has no dependency on the Ruby runtime, and any Go program can import github.com/go-ruby-regexp/regexp directly. The dependency runs one way only — go-embedded-ruby uses this engine as its regexp backend. A thin adapter in go-embedded-ruby/ruby/internal/regexp maps Ruby's Regexp and MatchData objects onto the engine's API, so byte offsets, named captures, and replacement semantics line up with what Ruby exposes.