Boyer Moore Algorithm
String Searching
Algorithm Explanation
Computer Science
Pattern Matching

Boyer Moore Algorithm Understanding and Example?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

The Boyer-Moore algorithm is a classic string-search algorithm designed to find a pattern inside a larger text with fewer comparisons than a naive left-to-right scan. Its practical strength comes from starting comparisons at the end of the pattern and using mismatch information to skip ahead aggressively.

Why Boyer-Moore is fast in practice

A naive search checks each text position one by one and often re-compares characters that clearly cannot lead to a match. Boyer-Moore tries to avoid that waste.

The key idea is simple: align the pattern against the text, compare from right to left, and if a mismatch occurs, move the pattern by more than one character whenever the mismatch proves that certain alignments are impossible.

In real text-search workloads, this often leads to sublinear behavior because the algorithm skips large sections of the input.

The two main heuristics

Classic Boyer-Moore combines two shift rules.

The bad-character rule says that if a mismatched text character appears somewhere else in the pattern, shift the pattern so that occurrence lines up with the mismatched text character. If the character does not appear in the pattern, the pattern can jump past it entirely.

The good-suffix rule says that if a suffix of the pattern matched successfully before the mismatch, shift the pattern so that another occurrence of that suffix, or a compatible prefix, lines up with the text.

A full production implementation often uses both rules together and takes the larger shift.

A runnable simplified implementation

The example below implements the bad-character part, which is enough to understand the search strategy clearly.

python
1def build_bad_char_table(pattern: str) -> dict[str, int]:
2    table = {}
3    for index, char in enumerate(pattern):
4        table[char] = index
5    return table
6
7
8def boyer_moore_search(text: str, pattern: str) -> list[int]:
9    if not pattern:
10        return []
11
12    bad_char = build_bad_char_table(pattern)
13    matches = []
14    shift = 0
15
16    while shift <= len(text) - len(pattern):
17        j = len(pattern) - 1
18
19        while j >= 0 and pattern[j] == text[shift + j]:
20            j -= 1
21
22        if j < 0:
23            matches.append(shift)
24            shift += len(pattern)
25        else:
26            mismatched = text[shift + j]
27            last_seen = bad_char.get(mismatched, -1)
28            shift += max(1, j - last_seen)
29
30    return matches
31
32
33text = "ABAAABCDABCDA"
34pattern = "ABCD"
35print(boyer_moore_search(text, pattern))

This example builds a table containing the rightmost occurrence of each pattern character. When a mismatch occurs, the search jumps based on that table instead of shifting by exactly one position.

Suppose the pattern is ABCD and the text segment under the pattern ends with a mismatch against X. If X does not appear anywhere in the pattern, the current alignment cannot possibly match, so the pattern can jump completely past that character.

If the mismatch character does appear in the pattern, the algorithm aligns the rightmost occurrence with the mismatched text position. That is the core reason Boyer-Moore skips so much work.

The right-to-left comparison order is not arbitrary. It makes the mismatch information more valuable because you learn about the far end of the pattern first, where bigger shifts are often possible.

Complexity and tradeoffs

The preprocessing cost is proportional to the pattern size. Search performance is excellent in practice on long texts and reasonably sized patterns, especially when the alphabet is not tiny and mismatches are common.

The exact worst-case complexity depends on the variant and the implementation details. What matters operationally is that Boyer-Moore is usually much faster than naive matching on typical text-search problems.

That said, it is not always the best choice. For very small patterns or heavily repetitive alphabets, the advantage may shrink. Algorithms such as KMP or modern library search implementations can be better fits depending on workload.

Common Pitfalls

A common mistake is assuming a simplified bad-character-only implementation is the full Boyer-Moore algorithm. The classic algorithm uses both bad-character and good-suffix shifts.

Another issue is forgetting overlapping matches. The simplified example above jumps by the full pattern length after a match, which is fine for many tasks but skips overlaps. If overlaps matter, use a smaller post-match shift.

It is also easy to quote asymptotic complexity too casually. Boyer-Moore is famous because it is fast in practice, but exact worst-case statements depend on which variant you implement.

Summary

  • Boyer-Moore searches from right to left and uses mismatches to skip ahead.
  • Its two classic heuristics are the bad-character rule and the good-suffix rule.
  • Even a simplified bad-character implementation can be much faster than naive searching.
  • The algorithm is especially effective on large texts with frequent mismatches.
  • Be clear about which variant you are implementing, especially when discussing complexity and overlapping matches.

Course illustration
Course illustration

All Rights Reserved.