Cancelling a long running regex match?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Regular expressions (regex) are a powerful tool for pattern matching within text. However, when dealing with large inputs and complex patterns, regex can become computationally expensive, resulting in matches that take seconds, minutes, or even longer. This article explains why regex matches can run long, how to cancel them safely, and how to prevent the problem in the first place.
Why Regex Matches Run Long
Regex engines process patterns by exploring possible matches through the input string. Three factors cause them to take excessive time:
Catastrophic Backtracking
Most regex engines use a backtracking NFA (nondeterministic finite automaton) approach. When a pattern contains nested quantifiers, the engine may explore an exponential number of paths before concluding there is no match.
The classic example is the pattern (a+)+ applied to the string "aaaaaaaaaab". The engine tries to match groups of a characters in every possible combination before finally failing at the b. For an input of length , this can produce backtracking steps.
Large Input Size
Even a well-written pattern can be slow on enormous inputs. A pattern with per-character work applied to a multi-gigabyte log file will take a long time regardless of backtracking behavior.
Unbounded Repetition
Patterns with .* or .+ without anchors can scan the entire input before failing, and when combined with alternation (|), they can do so multiple times.
How to Cancel a Long-Running Match
1. Timeouts
Many regex libraries provide a built-in timeout mechanism. This is the most reliable approach.
Python (using a thread with a timeout):
C# / .NET (built-in timeout support):
JavaScript (no built-in timeout, use a Worker):
2. Thread or Process Isolation
Run the regex match in a separate thread or process. If the match exceeds the time budget, terminate the thread or process. This is the most portable approach across languages.
3. Incremental Matching
Some engines support matching in chunks. Instead of feeding the entire input at once, process it in fixed-size blocks (for example, 64 KB at a time). This gives you natural checkpoints where you can decide to continue or abort.
Preventing Long Matches
Cancellation is a safety net. The better approach is to write patterns that cannot blow up.
Use Non-Greedy Quantifiers
Replace * and + with *? and +? when the greedy version is not needed. Non-greedy quantifiers match as little as possible first, which can reduce unnecessary scanning.
Anchor Your Patterns
Use ^ (start) and $ (end) anchors to limit the search scope. Without anchors, the engine tries the pattern at every position in the input.
Avoid Nested Quantifiers
Patterns like (a+)+, (a*)*, or (a|b)*c with overlapping alternatives are the primary cause of catastrophic backtracking. Refactor these into equivalent non-nested forms. For example, (a+)+ can be replaced with a+.
Use Atomic Groups or Possessive Quantifiers
If your engine supports them, atomic groups (?>...) and possessive quantifiers a++ prevent backtracking into a sub-expression once it has matched. This eliminates exponential blowup entirely for the wrapped portion.
Use Character Classes
Replace broad wildcards with specific character classes. Instead of .* to match "anything until a comma", use [^,]*. The character class fails immediately on a comma instead of scanning the entire input and backtracking.
Consider a DFA-Based Engine
Some regex libraries (like RE2 in Go, or the re2 Python binding) use a DFA-based engine that guarantees matching time for any pattern. The tradeoff is that they do not support backreferences or lookahead, but for many use cases those features are not needed.
Decision Table
| Situation | Recommended Approach |
| Untrusted user-supplied patterns | Timeout + DFA engine (RE2) |
| Known complex pattern, large input | Thread isolation + incremental matching |
| Pattern you control, small input | Refactor pattern to avoid backtracking |
| Performance-critical pipeline | DFA engine with guarantee |
Common Pitfalls
- Relying on
signal.alarmin Python, which only works on Unix and only in the main thread. Use threading or multiprocessing for portability. - Setting a timeout but not handling the timeout exception, causing your application to crash instead of recovering gracefully.
- Assuming that "short pattern" means "fast match." A short pattern like
(a+)+can take exponential time on certain inputs. - Using
re.DOTALL(making.match newlines) without realizing that.*now scans the entire multi-line input.
Summary
Long-running regex matches are primarily caused by catastrophic backtracking from nested quantifiers. The safest defense is a timeout mechanism. Writing patterns that avoid backtracking (using anchors, character classes, atomic groups, and non-greedy quantifiers) prevents the problem at its source. For maximum safety, especially with untrusted patterns, use a DFA-based engine like RE2 that guarantees linear-time matching.

