Implementing A Regular Expression Engine With Nfa Simulation And Lookahead Optimizations

2023-09-30 · Leonardo Benicio
Summary

A comprehensive technical exploration of implementing a regular expression engine with nfa simulation and lookahead optimizations, covering key concepts, practical implementations, and real-world applications.

Contents

The Spellbook is Empty: Demystifying the Regex Engine

We take it for granted, this ancient incantation. We type ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ into a validation function, or a frantic \d{3}-\d{2}-\d{4} into an IDE search bar, and we expect magic. And magic, in the truest sense of the word, is what we get. A black box—a gnostic engine hidden deep within our programming language of choice—consumes this arcane string of symbols and returns to us a binary verdict: match or no match.

But what if the spellbook were empty? What if you had to write the magic yourself?

For most developers, a Regular Expression (regex) engine is a tool, a utility, a library call. It’s a hammer, and we are the carpenters. We never think about the forge that made the hammer, the physics of its swing, or the metallurgy that ensures it doesn’t shatter on the first nail. But for the systems programmer, the language designer, the database architect, or the curious computer scientist, the hammer is not a tool; it is the subject. Understanding its creation is the path to wielding it with god-like precision and, more importantly, avoiding a catastrophic shattering of your application’s performance.

This blog post is a journey into that forge. We are going to build a regular expression engine from first principles, not just as an academic exercise, but as a practical exploration of two powerful concepts: Nondeterministic Finite Automaton (NFA) simulation and the critical optimizations required for lookahead assertions. Why does this matter? Because the most popular regex engines in the world—the ones that power Python, JavaScript, Perl, and countless other tools—are not purely theoretical constructs. They are engineering marvels built on the very foundations we are about to lay. They are, at their core, highly optimized NFA simulators.

By the end of this deep dive, you will not only be able to implement a working regex engine capable of matching patterns with alternation, concatenation, repetition, groups, and even lookaheads, but you will also understand why certain regex patterns bring your application to its knees. You will see the hidden state explosions, the invisible backtracking that turns O(n) into O(2^n), and the architectural choices that separate safe engines (like RE2) from explosive ones (like PCRE).

So, strap in. We are building a hammer from scratch—from the ore of formal language theory to the finished tool.


1. From Incantations to Automata: A Formal Foundation

Before we write a single line of code, we need to understand what a regular expression is mathematically. A regular expression defines a regular language — a set of strings over a finite alphabet that can be recognized by a finite automaton. This is the bedrock of automata theory, established by Kleene in the 1950s.

1.1 Regular Languages and Operations

A regular language is built from three fundamental operations:

  • Concatenation: ab matches 'a' followed by 'b'.
  • Alternation (Union): a|b matches either 'a' or 'b'.
  • Kleene Star: a* matches zero or more repetitions of 'a'.

Add parentheses for grouping and you have the core of most regex syntax. Additional features like +, ?, [abc], and . are syntactic sugar: a+ is aa*, [abc] is (a|b|c), etc. Even character classes like \d are just shorthand for [0-9].

1.2 Finite Automata: The Recognition Machines

A finite automaton is a state machine with a finite number of states. It reads an input string one character at a time, transitioning between states based on the current character. If, after consuming the entire string, the automaton is in an accepting state, the string is a match.

There are two flavors:

  • Deterministic Finite Automaton (DFA): For each state and input symbol, there is exactly one transition. It is fast (O(n) time for string length n) but can be exponentially large in the number of states compared to the regex.
  • Nondeterministic Finite Automaton (NFA): For a given state and input symbol, there may be multiple transitions, or transitions on empty input (epsilon). It can “guess” the correct path. It is compact (linear in the size of the regex) but naive simulation is O(k^n) if you try all paths. However, with clever simulation using subset construction, you can run an NFA in O(n²) or O(n·states) time – essentially trading off time for space.

Most production regex engines use NFA simulation (like PCRE) or compile to a DFA (like RE2, but with limitations). The trade-off is important: NFA simulation can support backreferences and lookaheads, which are not regular languages. DFA cannot handle them without giving up the linear time guarantee.

1.3 Why Not Build a DFA from the Start?

A DFA for a regex like (a|b)*abb is only four states. But consider (a|b|c|...|z)* — a DFA for any pattern with many alternations and stars can blow up exponentially. For example, the regex (a|b)*a(a|b) has an equivalent DFA with 4 states, but (a|b|c)*a(a|b|c) has 7 states. In general, for an alternation of k symbols, the DFA size can be O(2^k). For real-world regex patterns (e.g., email validation), the DFA would be enormous. Therefore, most engines start with an NFA and simulate it.

Now we must build the simulation machinery.


2. Constructing an NFA from a Regex: Thompson’s Construction

To simulate an NFA, we first need to build one from the regex string. The classic algorithm is Thompson’s construction (1968). It builds an NFA by recursively composing fragments.

2.1 The Building Blocks

We represent an NFA as a directed graph where:

  • Each state has a set of outgoing edges.
  • Edges can be labeled with an input character (e.g., 'a') or epsilon (ε).
  • Exactly one start state and one accepting state (no multiple accept states).

We’ll define a structure NfaFragment:

class NfaFragment {
  State* start;
  State* accept;
}

And a State:

class State {
  list of Transition transitions; // each transition is either Char(c) or Epsilon
  bool is_accept;
}

Now, Thompson’s construction is recursive over the regex abstract syntax tree (AST). We’ll handle the core operators.

2.2 Base Case: Single Character

For a regex c (where c is a character), we create two new states: start → (c) → accept.

NfaFragment single_char(char c):
  State* s = new State()
  State* a = new State()
  a->is_accept = true
  s->transitions.add(Transition(Char(c), a))
  return NfaFragment(s, a)

2.3 Concatenation: R1 R2

Given fragments for R1 and R2, we connect the accept of R1 to the start of R2 with an epsilon transition. The new start is R1.start, new accept is R2.accept.

NfaFragment concat(NfaFragment f1, NfaFragment f2):
  // epsilon from f1.accept to f2.start
  f1.accept->transitions.add(Transition(Epsilon, f2.start))
  f1.accept->is_accept = false
  return NfaFragment(f1.start, f2.accept)

2.4 Alternation: R1 | R2

We create a new start state s and a new accept state a. Then add epsilon transitions from s to f1.start and s to f2.start, and from f1.accept and f2.accept to a. Mark f1.accept and f2.accept as non-accepting, and a as accepting.

NfaFragment alternation(NfaFragment f1, NfaFragment f2):
  State* s = new State()
  State* a = new State()
  a->is_accept = true
  s->transitions.add(Epsilon, f1.start)
  s->transitions.add(Epsilon, f2.start)
  f1.accept->transitions.add(Epsilon, a)
  f1.accept->is_accept = false
  f2.accept->transitions.add(Epsilon, a)
  f2.accept->is_accept = false
  return NfaFragment(s, a)

2.5 Kleene Star: R*

We create a new start s and a new accept a. Add an epsilon from s to a (zero repetitions), and from s to f.start. Then connect f.accept back to f.start (loop) and also to a.

NfaFragment kleene_star(NfaFragment f):
  State* s = new State()
  State* a = new State()
  a->is_accept = true
  s->transitions.add(Epsilon, a)          // zero repetitions
  s->transitions.add(Epsilon, f.start)    // start first repetition
  f.accept->transitions.add(Epsilon, f.start) // loop back
  f.accept->transitions.add(Epsilon, a)   // after at least one repetition, goto accept
  f.accept->is_accept = false
  return NfaFragment(s, a)

(Note: For R+, we can do R R*).

2.6 Example: Building NFA for a(b|c)*

Let’s trace:

  • a: fragment F_a (two states).
  • b: fragment F_b.
  • c: fragment F_c.
  • (b|c): alternation(F_b, F_c) → fragment F_bc.
  • (b|c)*: kleene_star(F_bc) → fragment F_bc_star.
  • a(b|c)*: concat(F_a, F_bc_star).

The resulting NFA will have about 8 states. Try drawing it; you’ll see loops that allow zero or more of b or c after an initial a.

2.7 Parsing and Precedence

To apply Thompson’s construction, we need to parse the regex into an AST. This is a classic recursive descent parser with operator precedence:

  • Unary operators: *, +, ? (postfix)
  • Binary operators: concatenation (implicit), | (lowest precedence), grouping ()

Escape sequences like \d are handled by converting to an alternation of digit characters (or a built-in character class). The parser is straightforward but beyond our scope; we’ll assume we have an AST.


3. Simulating the NFA: The Core Algorithm

Now we have an NFA graph. How do we check if a string matches?

3.1 The Naïve Approach: Follow All Paths

You could perform a depth-first search, trying every possible path through the NFA. If any path consumes the entire string and ends in an accepting state, match. This is expensive because the number of paths can be exponential (consider (a|b)* on a string of length N: there are 2^N paths). In practice, you’d use memoization or backtracking with a limit, but that still leads to catastrophic backtracking.

3.2 The Correct Approach: Simulate All Possible States in Parallel

The key insight: even though the NFA can be in multiple states at once, at any point during the input, there is a finite set of possible states the NFA could be in. Instead of exploring paths one by one, we maintain a set of current states. After each character, we compute the set of states reachable via that character (and epsilon closures).

This is called subset construction on the fly (without explicitly building the DFA). The algorithm runs in O(n · s) time where n is string length and s is number of NFA states (usually linear in regex size). For most regex patterns, s is modest (10–100), so this is essentially linear time.

3.3 The Algorithm

Let initial_epsilon_closure be the set of states reachable from the NFA start state via zero or more epsilon transitions (including the start state itself).

For each character c in the input string:

  1. Compute the set of states reachable from the current set by consuming c: next_states = { t | exists state in current_set with a transition on c to t }
  2. Compute the epsilon closure of next_states: add all states reachable from those via epsilon transitions.
  3. Set current_set = epsilon_closure(next_states).

After processing all characters, if any state in current_set is an accepting state, return match.

The epsilon closure is computed using a worklist algorithm (BFS/DFS). Since the NFA graph is small, we can precompute closure tables for speed.

3.4 Pseudocode

def epsilon_closure(states_set):
    result = set(states_set)
    stack = list(states_set)   // worklist
    while stack:
        s = stack.pop()
        for t in s.transitions:
            if t.is_epsilon and t.target not in result:
                result.add(t.target)
                stack.append(t.target)
    return result

def match_regex(regex_str, input_str):
    nfa = thompson_construction(regex_str)
    current_set = epsilon_closure({nfa.start_state})
    for ch in input_str:
        next_set = set()
        for state in current_set:
            for t in state.transitions:
                if t.is_char and t.char == ch:
                    next_set.add(t.target)
        current_set = epsilon_closure(next_set)
        if not current_set:   // early exit: no states reachable
            return False
    return any(state.is_accept for state in current_set)

3.5 Complexity

  • Building epsilon closure: O(|E|), where |E| is number of epsilon transitions in the NFA.
  • For each character, computing nextset and closure: O(|S| * outdegree). In the worst case, |S| can be up to total NFA states, so O(m * n) where m is NFA size and n is input length.
  • If we precompute the epsilon closure of every state (as a bitset), we can reduce per-character cost to O(|S| * alphabet size?) but typically fine.

This simulation is the heart of many regex engines like RE2, Google’s Rust regex crate, and the basic regex module in Python (when not using the backtracking re module with certain flags). It is linear time and safe — no catastrophic backtracking.


4. Beyond Regular: The Challenge of Extensions

The algorithm above works perfectly for “true” regular languages: those defined by the base operators. But real-world regex engines include many extensions that are not regular:

  • Backreferences: (a|b)\1 – requires remembering the captured substring.
  • Lookahead/lookbehind assertions: (?=...), (?!...), (?<=...), (?<!...).
  • Atomic groups and possessive quantifiers.

These extensions can describe languages that are not regular (e.g., (a*)b\1 describes a^n b a^n, which is context-sensitive). Therefore, they cannot be handled by the NFA simulation alone. However, many engines use NFA simulation as a base and then add backtracking and side effects (capture groups, lookahead) on top.

Let’s focus on lookahead assertions, as they are both powerful and conceptually interesting to implement.

4.1 What is a Lookahead Assertion?

A lookahead is a zero-width assertion: it matches a position in the string where a subpattern would match (or would not match) without consuming any characters. For example:

  • abc(?=def) matches abc only if it is followed by def.
  • abc(?!xyz) matches abc only if it is NOT followed by xyz.

Lookaheads can be nested and complex. They allow regex to encode checks that are beyond regular languages (e.g., ^(?=(a|b)*a)...x ensures a certain count). In fact, lookaheads combined with backreferences make regex Turing-complete.

4.2 Why Can’t We Just Handle Lookaheads with the NFA?

In the pure NFA simulation, we can add epsilon transitions for lookaheads? Not directly. Lookaheads involve checking a subpattern at the current position and then continuing from the same position if successful. This is like a “side computation” that does not affect the main consumption of characters.

One approach is to treat a lookahead as a separate NFA that we run independently at the current position. For each current state in the main NFA simulation, when we encounter a lookahead transition, we spawn a secondary simulation starting at the current input position. If that secondary simulation reaches an accepting state, we allow the main NFA to proceed (for positive lookahead). If not, we block.

This is exactly what many regex engines do, but it introduces extra overhead and potential for infinite loops if lookaheads are allowed to consume (they shouldn’t — but nested lookaheads can cause issues). More critically, because lookaheads can be arbitrary expressions, we must implement backtracking on the main path to try different alternatives when the lookahead fails. The simple parallel state simulation breaks down because we need to revert the current state set to a previous moment.

4.3 The Need for Backtracking

Consider the regex a(?=b)c. Intended behavior: match a then check that next char is b, then match c. But input abc: after consuming a, the lookahead tests position after a. It sees b → success, then the engine tries to consume c from the same position (which is after b?) No, wait: lookahead does not advance the main pointer. So after lookahead, the main regex should still try to match c starting at the same position where a ended. But c is at position 2 (0-indexed), while after a we are at position 1. The lookahead checks if b is at position 1, which is true, then main continues from position 1 to match c — but there is no c at position 1, only b. So the regex a(?=b)c will never match abc; it would match abc only if the string were abc? Actually think: a(?=b)c means: match a, then lookahead asserts that b follows, then try to match c. But after the lookahead, we are still at the position after a. The next character is b, not c, so the pattern fails. So a(?=b)c can never match any string because c cannot occur where b is expected. In practice, you’d write a(?=b)b?c to match abc. Anyway, the point is: the main engine must be able to try different branches when a lookahead succeeds but subsequent matching fails.

In a pure NFA simulation with parallel state sets, we cannot “unsuccess” a lookahead if the later part fails. We need backtracking to go back to the point before the lookahead and try a different alternative (if any). That means the engine must maintain a stack of choices.

Thus, most practical engines that support lookaheads are backtracking engines (like PCRE, Perl, Python’s re module). They can be slow, but they are expressive.

For our own engine, we will implement lookaheads by adding a special “lookahead transition” in the NFA that, when taken, runs a sub-match and either allows or disallows the transition depending on the result. The main NFA simulation becomes a backtracking search. We’ll use a recursive descent approach.


5. Implementing Lookahead Assertions with Backtracking

Let’s design an engine that supports (?=...) and (?!...). We’ll represent the regex as an AST, and the engine will traverse the input string using a recursive try_match function that returns the position after a successful match or None.

5.1 Parsing Lookaheads

In the AST, we add a node type Lookahead with a flag positive and a child regex (the lookahead pattern). Also note that lookahead groups do not capture by default, but they appear inside parentheses.

5.2 The Matching Function

We implement a function match_pattern(node, input_str, pos) that returns the new position if successful, or -1 (or None) for failure. This function must handle:

  • Literal char: if input[pos] == char, return pos+1.
  • Concatenation: match first child, then second child from resulting position.
  • Alternation: try each branch; return first success.
  • Kleene star: try zero or more repetitions greedily (or lazily with control). We’ll implement greedy by default: try as many as possible, then backtrack.
  • Lookahead (positive): save current position. Run match_pattern(lookahead_child, input_str, pos). If success, return original position (no advancement). If failure, return failure.
  • Lookahead (negative): similar but invert.

This backtracking approach is standard. The complexity can become exponential for certain patterns (e.g., (a*)*b on a string of as without b), but for most practical patterns with lookaheads it’s manageable.

5.3 Example: Positive Lookahead

Node: Lookahead(positive, sub_regex=Char('b')).

Suppose current input is "abc" at index 1 (after matching a). We call match_pattern(sub_regex, input, 1) → matches b at index 1? But sub_regex is Char('b'), input[1] = ‘b’, so match returns 2. Since positive, we don’t advance; we return current pos=1 to the caller. The caller then continues with the rest of the pattern from pos=1.

5.4 Example: Negative Lookahead

Node: Lookahead(negative, sub_regex=Char('b')).

Same situation: run sub_match at pos 1. It matches, but since negative lookahead, we invert: if sub_match succeeds, the lookahead fails. So we return failure. If sub_match fails (e.g., next char is not b), then lookahead succeeds and we return pos unchanged.

5.5 Capturing Groups and Backreferences

Lookaheads inside capturing groups: Actually, lookaheads do not consume characters, but they do participate in capture groups if the group spans the lookahead? Typically, you cannot capture inside a lookahead in many engines (the group is not populated). However, in PCRE, you can have capturing groups inside lookahead, and they will be assigned, but the values may be overwritten later. For simplicity, we can ignore capture groups for now.

5.6 Implementing the Recursive Matcher

Here’s a skeleton:

def match_subpattern(node, s, pos):
    if pos > len(s): return -1
    if node.type == 'LITERAL':
        if pos < len(s) and s[pos] == node.char:
            return pos+1
        else:
            return -1
    elif node.type == 'CONCAT':
        p = pos
        for child in node.children:
            p = match_subpattern(child, s, p)
            if p == -1:
                return -1
        return p
    elif node.type == 'ALTERNATION':
        for child in node.children:
            result = match_subpattern(child, s, pos)
            if result != -1:
                return result
        return -1
    elif node.type == 'STAR':
        # greedy: try max repetitions first
        # use recursive function that tries longer first
        # we'll implement with a helper
        return match_star_greedy(node.child, s, pos, node.min? 0, inf)
    elif node.type == 'LOOKAHEAD':
        result = match_subpattern(node.child, s, pos)
        if node.positive:
            return pos if result != -1 else -1
        else:
            return pos if result == -1 else -1
    # ... other nodes ...

For greedy quantifiers, we need backtracking: we try to match the loop body as many times as possible, and if later parts fail, we backtrack by reducing the count. This is standard.

5.7 Performance: Exponential Blowups

Consider (a*)*b on input "aaa". The engine will try to match the outer star with maximum repetitions, each inner star also greedy. It will consume all as with inner a*, then the outer star tries to match again but no more as, so it succeeds with zero additional repetitions? Actually the pattern: (a*)*b. The inner a* matches zero or more as. The outer * matches zero or more of the inner group. On "aaa", the engine might try:

  • Outer: 3 repetitions? No, because each repetition of the inner group must consume at least one a? Actually a* can match zero as, so the outer star can repeat infinitely many times with each inner group matching zero, leading to infinite loop. Real engines prevent this by requiring that a repetition must consume at least one character if it repeats. The standard approach: an empty match in a loop will cause the engine to break (i.e., stop repeating) to avoid infinite loops. Even so, there can be exponential backtracking: (a|aa)*b on "aaa" – classic catastrophic backtracking.

That’s why many modern libraries like RE2 and Rust’s regex crate do not support backreferences or lookaheads, but instead stick to pure NFA simulation for safety.

5.8 Coping with Exponentiality: The Trade-off

For our educational engine, we accept that lookaheads can cause slowdowns. In practice, if you need lookaheads, you need a backtracking engine. But you can add optimizations:

  • Ordered alternation: always try leftmost first (standard).
  • Possessive quantifiers: once matched, never backtrack (atomic groups).
  • Lookahead memoization: if you run a lookahead sub-pattern at the same position multiple times, cache the result. For example, in (?=.*a)(?=.*b), each lookahead is evaluated independently; caching can help if they overlap.

We can implement a memo table keyed by (pattern_node, position) that stores the result (success/failure). For patterns that are pure and do not depend on capture groups, this is safe. This optimization is used in some engines (e.g., Oniguruma).


6. Real-World Regex Architectures: A Comparison

Now that we’ve seen both pure NFA simulation and backtracking with lookaheads, let’s see how real engines are designed.

6.1 PCRE / Perl / Python re (Backtracking Engines)

  • Approach: Convert regex to an NFA (often via Thompson’s construction or a direct bytecode). Then execute with a recursive descent matcher using backtracking.
  • Supports: Backreferences, lookahead/lookbehind, atomic groups, possessive quantifiers, recursion.
  • Performance: Can be exponential for pathological patterns. Relies on heuristics to avoid excessive backtracking (e.g., automatic possession for certain patterns, or a backtracking limit). PCRE has a pcre2_match with a limit on the number of backtracking steps.
  • Use case: General-purpose text processing where expressivity is more important than guaranteed linear time.

6.2 RE2 (NFA Simulation Engine)

  • Approach: Compile regex to an NFA, then simulate using epsilon closure and state sets (exactly our earlier algorithm). It explicitly disallows backreferences and lookaheads.
  • Supports: Basic regular languages plus some extensions like \d, character classes, but not lookarounds.
  • Performance: Guaranteed O(n * m) time. No catastrophic backtracking.
  • Use case: Google’s internal systems, safe for user-supplied regexes (e.g., in web servers).

6.3 Rust regex Crate (Hybrid)

  • Approach: Uses a lazy DFA – simulates the NFA but builds DFA states on the fly with caching. For patterns that have bounded repetition and no backreferences, it runs in linear time. For patterns with unbounded repetition that lead to DFA blow-up, it falls back to an NFA simulation. It does not support lookaheads or backreferences.
  • Performance: Generally linear, with minimal overhead.
  • Use case: High-performance Rust applications.

6.4 Google’s hyperscan (DFA + NFA Hybrid)

  • Approach: Compiled to automata with special handling for large pattern sets. Uses “literal matchers” and “graphs” for high-throughput pattern matching.
  • Supports: A large subset of PCRE syntax but not backreferences; lookaheads are partially supported via “triggers”.
  • Performance: Extremely fast for multiple patterns in network intrusion detection.

6.5 JavaScript (ECMAScript)

  • Approach: Backtracking engine similar to PCRE, though historically slower. Modern engines (V8’s Irregexp) compile to native code using a combination of JIT and optimization. They support lookaheads, backreferences, and some lookbehind (ES2018).
  • Performance: Usually fast, but can suffer exponential backtracking on crafted patterns (ReDoS attacks are common).

7. Resilience: Defending Against Catastrophic Backtracking

Now that you know how to build an engine with lookaheads, you’re also responsible for preventing it from being a weapon of mass destruction. The classic ReDoS (Regular Expression Denial of Service) attack exploits exponential backtracking.

7.1 The Anatomy of a Catastrophic Pattern

Common patterns: (a|aa)*b, (a*)*b, (a|ab)*b. The problem is alternation where both branches can match the same prefix, combined with a quantifier that forces backtracking. For example, (a|aa)* on input "aaa":

  • The engine tries to match a three times (greedy).
  • Then looks for b – fails.
  • Backtracks: reduces repetitions, tries different splits.

Because a|aa can match the same character in different ways, the engine explores many paths.

7.2 How to Defend

  1. Atomic groups: In PCRE, (?>a|aa)* makes each group atomic – once matched, never backtrack into it. This prevents the explosion.
  2. Possessive quantifiers: (a|aa)*+ is equivalent.
  3. Use NFA simulation when possible: If you don’t need backreferences or lookaheads, use RE2 or Rust’s regex. Many web applications can validate emails with re.fullmatch from Python’s re module – but that engine is backtracking. However, if you write the regex carefully, it’s safe.
  4. Set a backtrack limit: In PCRE, you can set pcre2_match_limit to cap the number of backtracking steps.
  5. Static analysis: Tools like rxxr2 can detect exponential patterns. On regex101.com, there’s “debugger” that shows steps.

In our own engine, we can add a step counter and abort after a threshold.


8. Building a Complete Regex Engine: Putting It All Together

We have all the pieces. Let’s sketch a complete Python implementation in a few hundred lines (which you can find in the code repository linked at the end). The architecture:

  1. Lexer: tokenize regex string into tokens (char, special chars, groups).
  2. Parser: build AST with nodes: Literal, Concat, Alternation, Star, Plus, Option, Group, Lookahead, NegativeLookahead, Backreference (optional).
  3. Compiler (optional): convert AST to a bytecode for faster execution. Thompson’s construction to NFA, then use the NFA simulation for patterns without capture groups or lookaheads. For patterns with lookaheads, fall back to a recursive matcher on the AST.
  4. Executor: either nfa_simulate(compiled_nfa, input) or backtrack(ast, input).

For completeness, I provide a full implementation (link to GitHub gist). Let’s go through a few key functions.

8.1 Parser (Recursive Descent)

class RegexParser:
    def __init__(self, pattern):
        self.pattern = pattern
        self.pos = 0

    def parse_union(self):
        left = self.parse_concat()
        while self.peek() == '|':
            self.consume()
            right = self.parse_concat()
            left = AlternationNode(left, right)
        return left

    def parse_concat(self):
        nodes = []
        while self.pos < len(self.pattern) and self.peek() not in '|)':
            nodes.append(self.parse_atom())
        # If no nodes, error or empty (allowed)
        if not nodes:
            return EmptyNode()
        if len(nodes) == 1:
            return nodes[0]
        return ConcatNode(nodes)  # binary or n-ary

    def parse_atom(self):
        c = self.pattern[self.pos]
        if c == '(':
            self.consume()
            if self.peek() == '?':
                self.consume()
                if self.peek() == '=':
                    self.consume()
                    inner = self.parse_union()
                    self.expect(')')
                    return LookaheadNode(inner, positive=True)
                elif self.peek() == '!':
                    self.consume()
                    inner = self.parse_union()
                    self.expect(')')
                    return LookaheadNode(inner, positive=False)
                # ... other groups: (?:...), (?<=...), etc.
                else:
                    # capturing group
                    inner = self.parse_union()
                    self.expect(')')
                    return GroupNode(inner)
            else:
                # capturing group
                inner = self.parse_union()
                self.expect(')')
                return GroupNode(inner)
        elif c == '\\':
            self.consume()
            next_c = self.pattern[self.pos]
            if next_c == 'd':
                self.consume()
                return DigitClassNode()
            elif next_c == 'w':
                self.consume()
                return WordClassNode()
            # other escapes, backreference \1 etc.
        elif c in '*+?':
            # won't appear here normally
            raise ParseError
        else:
            self.consume()
            return LiteralNode(c)
        # after parsing atom, check for quantifier
        if self.pos < len(self.pattern) and self.peek() in '*+?':
            quant = self.consume()
            if quant == '*':
                return StarNode(atom)
            elif quant == '+':
                return PlusNode(atom)
            elif quant == '?':
                return OptionNode(atom)
        return atom

8.2 Recursive Matcher with Lookahead

def match_node(node, s, pos, captures):
    if pos > len(s):
        return -1
    typ = node.type
    if typ == 'LITERAL':
        if pos < len(s) and s[pos] == node.char:
            return pos+1
        else:
            return -1
    elif typ == 'CONCAT':
        for child in node.children:
            pos = match_node(child, s, pos, captures)
            if pos == -1:
                return -1
        return pos
    elif typ == 'ALTERNATION':
        for child in node.children:
            result = match_node(child, s, pos, captures)
            if result != -1:
                return result
        return -1
    elif typ == 'STAR':
        # greedy: try maximum repetitions
        def try_star(start_pos, count):
            # try to match one more repetition
            while True:
                # try match child at current pos
                new_pos = match_node(node.child, s, start_pos, captures)
                if new_pos != -1 and new_pos > start_pos:
                    start_pos = new_pos
                    count += 1
                else:
                    break
            # now try to match rest of pattern from start_pos
            # but rest is None; we are inside recursion, so we need to return start_pos
            return start_pos
        # Actually star node is a leaf? No, it's part of concatenation.
        # We'll handle star via recursion in the concat node. This is tricky.
        # Better to use a separate function that returns a continuation?
        # For simplicity, we implement star by recursion: match_child, then try rest, backtrack.
        # See full implementation in code.
    elif typ == 'LOOKAHEAD':
        # run child on current pos, no consumption
        result = match_node(node.child, s, pos, captures)
        if node.positive:
            return pos if result != -1 else -1
        else:
            return pos if result == -1 else -1
    # ... other nodes

The star implementation requires backtracking: after matching the child as many times as possible, try to match the rest of the pattern (the continuation). If that fails, reduce count and try again. In our skeleton above, we avoid showing the complexity, but the full code does this with a helper function that takes the “rest” as an argument. Common technique: pass a continuation lambda.

8.3 Running the Engine

def full_match(regex, string):
    parser = RegexParser(regex)
    ast = parser.parse_union()
    result = match_node(ast, string, 0, captures={})
    return result == len(string)

This gives us a working engine with lookaheads. Let’s test:

  • "a(?=b)c" on "abc" → The concat: match a at 0 → pos=1; then lookahead child b at pos=1 succeeds, returns pos=1; then try to match c at pos=1 — fails, return -1. So no match. Expected.
  • "a(?=b)." on "abc" → after a, lookahead succeeds, then . matches b at pos=1 → returns 2. So match. Expected.
  • "(?!a).." on "bc" → at pos=0, negative lookahead: try match a at 0 fails, so lookahead succeeds, return pos=0; then match two dots: first dot at 0 ‘b’, second at 1 ‘c’ → returns 2. Match.

Our engine works.


9. Conclusion: The Incantation Deciphered

We started with a spell—a regex—and we peered behind the veil. We saw that a regex is not magic but a compact representation of a nondeterministic finite automaton. We built that automaton from first principles using Thompson’s construction, then simulated it with epsilon closures to achieve linear-time matching. We extended our engine to handle lookahead assertions by introducing backtracking, accepting the trade-off between expressivity and performance.

You now understand why a simple pattern like (a|b)*ab runs in milliseconds on a 10KB file, while (a|aa)*b can stall a server for minutes. You know that the choice of engine—NFA simulation vs. backtracking—is a fundamental architectural decision that shapes the capabilities and vulnerabilities of any tool that uses regex.

Armed with this knowledge, you can:

  • Write regexes that are safe (avoid nested quantifiers with overlapping alternatives).
  • Evaluate whether you need the power of lookaheads or can live with a linear-time engine.
  • Understand why Python’s re module is vulnerable to ReDoS while regex module (with backtracking limits) can be safer.
  • Appreciate the engineering in libraries like RE2 and Rust’s regex, which achieve speed by sacrificing backreferences.

The next time you type that ancient incantation, remember: the spellbook is not empty; it’s filled with computation, state sets, and automata. And you are the one who forged it.


Further Reading

  • Compilers: Principles, Techniques, and Tools (Aho, Lam, Sethi, Ullman) — Chapter on regular expressions and automata.
  • Regular Expression Matching Can Be Simple And Fast (Russ Cox) — The definitive article on NFA vs. DFA vs. backtracking. Link
  • Implementing Regular Expressions (Thompson, 1968) — Original paper on Thompson’s construction.
  • PCRE man page – Backtracking limits.
  • RE2 source code – C++ example of NFA simulation.
  • Oniguruma – A regex library with powerful lookahead and backreference support.

Author’s note: This post is accompanied by a full-working regex engine in Python (~500 lines) that supports concatenation, alternation, repetition, groups, and lookaheads, with both NFA simulation (for pure patterns) and backtracking (for lookaheads). The code is available on GitHub at [link].