Algorithm for regular expressions - combinations on or
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In regular expressions, the | operator means alternation: match the pattern on the left or the pattern on the right. Algorithmically, alternation is usually handled by parsing the expression with the correct precedence and then building an automaton with a branch that can follow either path.
What Alternation Means
The regex:
matches either cat or dog.
The important detail is that alternation composes with concatenation and grouping. For example:
- '
ab|cdmeansaborcd' - '
a(b|c)dmeansabdoracd'
That is why a good regex algorithm begins with parsing, not with ad hoc string splitting on the | character.
Parse with Precedence First
A regex parser usually treats precedence like this:
- grouping with parentheses
- repetition such as
*,+,? - concatenation
- alternation with
|
So the regex engine or parser first builds a syntax tree. Alternation becomes a node with two children.
For a|bc, the tree is not (a|b)c. It is a or bc because concatenation binds more tightly than alternation.
Thompson NFA Construction
A classic algorithm for regex alternation is Thompson’s construction. In that approach, each subexpression becomes an NFA fragment with a start and accept state.
For alternation A|B, you create:
- a new start state
- epsilon transitions to the start of
AandB - a new accept state
- epsilon transitions from the accept states of
AandBto the new accept state
Conceptually:
That is the standard automata-theory answer to “how do I implement OR in regex?”
Small Python Example of the Idea
Here is a very small parser-style illustration that splits top-level alternation only when it is not inside parentheses:
This is not a full regex engine, but it demonstrates a key parsing rule: alternation must be recognized only at the correct nesting level.
Why Backtracking Engines Still Need Structure
Many practical regex engines are implemented as backtracking matchers rather than full DFA compilers, but they still need the same structural understanding.
When the engine sees alternation, it tries one branch and, if needed, backtracks to try the other. The operational style differs from Thompson NFA simulation, but the abstract meaning of alternation is the same.
So whether you build an NFA, a DFA, or a backtracking engine, | always means “fork into alternative subpatterns.”
Combining Many OR Branches
If you have many literal alternatives, there are optimizations beyond naïvely trying them one by one.
Examples include:
- trie-like compilation for shared prefixes
- DFA state merging
- prefix factoring such as rewriting
cat|carintoca(t|r)conceptually
These optimizations matter for performance, but they preserve the same logical structure.
Common Pitfalls
The most common mistake is splitting on every | without respecting parentheses. That breaks grouped patterns immediately.
Another issue is ignoring operator precedence. Alternation usually has lower precedence than concatenation, so parse order matters.
People also assume regex alternation is just string list matching. Once quantifiers, groups, and nesting appear, you need a real parser or a proven regex engine.
Finally, do not confuse the theoretical algorithm with one specific runtime implementation. Thompson NFA, DFA compilation, and backtracking engines all represent alternation differently in execution while preserving the same semantics.
Summary
- Regex
|means alternation: match the left branch or the right branch. - Correct handling starts with parsing precedence and grouping properly.
- Thompson’s construction models alternation as a branch with epsilon transitions.
- Backtracking engines implement the same idea by trying alternative paths.
- Any serious regex algorithm must treat top-level and nested
|differently.

