regular expressions
regex algorithm
pattern matching
computational theory
string manipulation

Algorithm for regular expressions - combinations on or

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In regular expressions, the | operator means alternation: match the pattern on the left or the pattern on the right. Algorithmically, alternation is usually handled by parsing the expression with the correct precedence and then building an automaton with a branch that can follow either path.

What Alternation Means

The regex:

text
cat|dog

matches either cat or dog.

The important detail is that alternation composes with concatenation and grouping. For example:

  • 'ab|cd means ab or cd'
  • 'a(b|c)d means abd or acd'

That is why a good regex algorithm begins with parsing, not with ad hoc string splitting on the | character.

Parse with Precedence First

A regex parser usually treats precedence like this:

  1. grouping with parentheses
  2. repetition such as *, +, ?
  3. concatenation
  4. alternation with |

So the regex engine or parser first builds a syntax tree. Alternation becomes a node with two children.

For a|bc, the tree is not (a|b)c. It is a or bc because concatenation binds more tightly than alternation.

Thompson NFA Construction

A classic algorithm for regex alternation is Thompson’s construction. In that approach, each subexpression becomes an NFA fragment with a start and accept state.

For alternation A|B, you create:

  • a new start state
  • epsilon transitions to the start of A and B
  • a new accept state
  • epsilon transitions from the accept states of A and B to the new accept state

Conceptually:

text
      epsilon -> A -> epsilon
start                        accept
      epsilon -> B -> epsilon

That is the standard automata-theory answer to “how do I implement OR in regex?”

Small Python Example of the Idea

Here is a very small parser-style illustration that splits top-level alternation only when it is not inside parentheses:

python
1def split_top_level_or(pattern: str):
2    parts = []
3    depth = 0
4    start = 0
5
6    for i, ch in enumerate(pattern):
7        if ch == '(':
8            depth += 1
9        elif ch == ')':
10            depth -= 1
11        elif ch == '|' and depth == 0:
12            parts.append(pattern[start:i])
13            start = i + 1
14
15    parts.append(pattern[start:])
16    return parts
17
18print(split_top_level_or("a|(bc|de)|f"))

This is not a full regex engine, but it demonstrates a key parsing rule: alternation must be recognized only at the correct nesting level.

Why Backtracking Engines Still Need Structure

Many practical regex engines are implemented as backtracking matchers rather than full DFA compilers, but they still need the same structural understanding.

When the engine sees alternation, it tries one branch and, if needed, backtracks to try the other. The operational style differs from Thompson NFA simulation, but the abstract meaning of alternation is the same.

So whether you build an NFA, a DFA, or a backtracking engine, | always means “fork into alternative subpatterns.”

Combining Many OR Branches

If you have many literal alternatives, there are optimizations beyond naïvely trying them one by one.

Examples include:

  • trie-like compilation for shared prefixes
  • DFA state merging
  • prefix factoring such as rewriting cat|car into ca(t|r) conceptually

These optimizations matter for performance, but they preserve the same logical structure.

Common Pitfalls

The most common mistake is splitting on every | without respecting parentheses. That breaks grouped patterns immediately.

Another issue is ignoring operator precedence. Alternation usually has lower precedence than concatenation, so parse order matters.

People also assume regex alternation is just string list matching. Once quantifiers, groups, and nesting appear, you need a real parser or a proven regex engine.

Finally, do not confuse the theoretical algorithm with one specific runtime implementation. Thompson NFA, DFA compilation, and backtracking engines all represent alternation differently in execution while preserving the same semantics.

Summary

  • Regex | means alternation: match the left branch or the right branch.
  • Correct handling starts with parsing precedence and grouping properly.
  • Thompson’s construction models alternation as a branch with epsilon transitions.
  • Backtracking engines implement the same idea by trying alternative paths.
  • Any serious regex algorithm must treat top-level and nested | differently.

Course illustration
Course illustration

All Rights Reserved.