Count number of occurrences of a substring in a string

Python

String Manipulation

Substring Counting

Programming

Coding Tips

Count number of occurrences of a substring in a string

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Counting substring occurrences seems simple until details like overlap, case sensitivity, and Unicode normalization appear. Different methods give different answers depending on those rules, so choosing the right approach matters. In Python, you can solve most cases with built-in methods, regular expressions, or a controlled scan loop.

Core Sections

Non-overlapping count with `str.count`

For standard non-overlapping counts, use the built-in method. It is concise and implemented efficiently.

python

text = "banana"
print(text.count("an"))  # 2
print(text.count("ana")) # 1

str.count does not count overlapping matches. That behavior is correct for many text analytics tasks.

Overlapping count with regex lookahead

If overlapping occurrences are required, use a lookahead pattern.

python

1import re
2
3text = "banana"
4pattern = re.compile(r"(?=(ana))")
5count = len(pattern.findall(text))
6print(count)  # 2, matches at positions 1 and 3

Lookahead checks each position without consuming characters, which enables overlap counting.

Controlled scan loop for full transparency

A manual scan gives explicit control over overlap behavior and can be easier to debug in domain-specific logic.

python

1def count_substring(text: str, sub: str, overlap: bool = False) -> int:
2    if sub == "":
3        return 0
4    step = 1 if overlap else len(sub)
5    i = 0
6    c = 0
7    while i <= len(text) - len(sub):
8        if text[i:i+len(sub)] == sub:
9            c += 1
10            i += step
11        else:
12            i += 1
13    return c
14
15print(count_substring("aaaa", "aa", overlap=False))  # 2
16print(count_substring("aaaa", "aa", overlap=True))   # 3

This approach is useful when requirements are explicit and test-heavy.

Case-insensitive and normalized matching

For user text, normalize case and optionally Unicode representation before counting.

python

1import unicodedata
2
3text = "Café cafe CAFE"
4sub = "cafe"
5norm_text = unicodedata.normalize("NFKD", text).casefold()
6norm_sub = unicodedata.normalize("NFKD", sub).casefold()
7print(norm_text.count(norm_sub))

Normalization reduces surprises in multilingual data.

Performance guidance

For one-off checks in moderate text, str.count is usually best. For complex matching rules, regex can be expressive but sometimes slower. For very large streams, process chunk by chunk and account for boundary overlap between chunks. Always benchmark with representative input, because algorithm choice that looks fastest on tiny strings can regress on real workloads.

Streaming approach for very large files

When input does not fit comfortably in memory, process text in chunks and keep a trailing window to preserve boundary matches.

python

1def count_in_stream(file_obj, sub: str, chunk_size: int = 4096) -> int:
2    if not sub:
3        return 0
4
5    carry = ""
6    total = 0
7    keep = len(sub) - 1
8
9    while True:
10        chunk = file_obj.read(chunk_size)
11        if not chunk:
12            break
13        text = carry + chunk
14        total += text.count(sub)
15        carry = text[-keep:] if keep > 0 else ""
16
17    return total

This method keeps memory bounded while preserving correctness at chunk boundaries. For overlap counting in streams, use a custom scan function instead of raw count.

Test cases that prevent subtle regressions

Maintain a compact table of expected counts for overlap and non-overlap rules. Include empty input, repeated characters, punctuation, and Unicode examples. These cases prevent accidental behavior changes when optimizing logic or replacing regex patterns.

For search-heavy services, document whether matching is literal or pattern-based and whether normalization runs before counting. Clear contracts prevent data teams and backend teams from reporting different metrics from the same text source.

Common Pitfalls

Assuming str.count includes overlapping matches.
Ignoring case-folding requirements in user-generated text.
Forgetting Unicode normalization for visually similar characters.
Treating empty substring rules inconsistently across code paths.
Benchmarking with tiny examples and extrapolating to production workloads.

Summary

Use str.count for straightforward non-overlapping substring counts.
Use regex lookahead or a manual scan for overlapping counts.
Normalize case and Unicode when text sources are heterogeneous.
Make overlap behavior explicit in function design and tests.
Choose the method that matches your correctness rules before optimizing.

Count number of occurrences of a substring in a string

Master System Design with Codemia

Introduction

Core Sections

Non-overlapping count with str.count

Overlapping count with regex lookahead

Controlled scan loop for full transparency

Case-insensitive and normalized matching

Performance guidance

Streaming approach for very large files

Test cases that prevent subtle regressions

Common Pitfalls

Summary

Non-overlapping count with `str.count`