Python
String Manipulation
Substring Counting
Programming
Coding Tips

Count number of occurrences of a substring in a string

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Counting substring occurrences seems simple until details like overlap, case sensitivity, and Unicode normalization appear. Different methods give different answers depending on those rules, so choosing the right approach matters. In Python, you can solve most cases with built-in methods, regular expressions, or a controlled scan loop.

Core Sections

Non-overlapping count with str.count

For standard non-overlapping counts, use the built-in method. It is concise and implemented efficiently.

python
text = "banana"
print(text.count("an"))  # 2
print(text.count("ana")) # 1

str.count does not count overlapping matches. That behavior is correct for many text analytics tasks.

Overlapping count with regex lookahead

If overlapping occurrences are required, use a lookahead pattern.

python
1import re
2
3text = "banana"
4pattern = re.compile(r"(?=(ana))")
5count = len(pattern.findall(text))
6print(count)  # 2, matches at positions 1 and 3

Lookahead checks each position without consuming characters, which enables overlap counting.

Controlled scan loop for full transparency

A manual scan gives explicit control over overlap behavior and can be easier to debug in domain-specific logic.

python
1def count_substring(text: str, sub: str, overlap: bool = False) -> int:
2    if sub == "":
3        return 0
4    step = 1 if overlap else len(sub)
5    i = 0
6    c = 0
7    while i <= len(text) - len(sub):
8        if text[i:i+len(sub)] == sub:
9            c += 1
10            i += step
11        else:
12            i += 1
13    return c
14
15print(count_substring("aaaa", "aa", overlap=False))  # 2
16print(count_substring("aaaa", "aa", overlap=True))   # 3

This approach is useful when requirements are explicit and test-heavy.

Case-insensitive and normalized matching

For user text, normalize case and optionally Unicode representation before counting.

python
1import unicodedata
2
3text = "Café cafe CAFE"
4sub = "cafe"
5norm_text = unicodedata.normalize("NFKD", text).casefold()
6norm_sub = unicodedata.normalize("NFKD", sub).casefold()
7print(norm_text.count(norm_sub))

Normalization reduces surprises in multilingual data.

Performance guidance

For one-off checks in moderate text, str.count is usually best. For complex matching rules, regex can be expressive but sometimes slower. For very large streams, process chunk by chunk and account for boundary overlap between chunks. Always benchmark with representative input, because algorithm choice that looks fastest on tiny strings can regress on real workloads.

Streaming approach for very large files

When input does not fit comfortably in memory, process text in chunks and keep a trailing window to preserve boundary matches.

python
1def count_in_stream(file_obj, sub: str, chunk_size: int = 4096) -> int:
2    if not sub:
3        return 0
4
5    carry = ""
6    total = 0
7    keep = len(sub) - 1
8
9    while True:
10        chunk = file_obj.read(chunk_size)
11        if not chunk:
12            break
13        text = carry + chunk
14        total += text.count(sub)
15        carry = text[-keep:] if keep > 0 else ""
16
17    return total

This method keeps memory bounded while preserving correctness at chunk boundaries. For overlap counting in streams, use a custom scan function instead of raw count.

Test cases that prevent subtle regressions

Maintain a compact table of expected counts for overlap and non-overlap rules. Include empty input, repeated characters, punctuation, and Unicode examples. These cases prevent accidental behavior changes when optimizing logic or replacing regex patterns.

For search-heavy services, document whether matching is literal or pattern-based and whether normalization runs before counting. Clear contracts prevent data teams and backend teams from reporting different metrics from the same text source.

Common Pitfalls

  • Assuming str.count includes overlapping matches.
  • Ignoring case-folding requirements in user-generated text.
  • Forgetting Unicode normalization for visually similar characters.
  • Treating empty substring rules inconsistently across code paths.
  • Benchmarking with tiny examples and extrapolating to production workloads.

Summary

  • Use str.count for straightforward non-overlapping substring counts.
  • Use regex lookahead or a manual scan for overlapping counts.
  • Normalize case and Unicode when text sources are heterogeneous.
  • Make overlap behavior explicit in function design and tests.
  • Choose the method that matches your correctness rules before optimizing.

Course illustration
Course illustration

All Rights Reserved.