Count number of occurrences of a substring in a string
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Counting substring occurrences seems simple until details like overlap, case sensitivity, and Unicode normalization appear. Different methods give different answers depending on those rules, so choosing the right approach matters. In Python, you can solve most cases with built-in methods, regular expressions, or a controlled scan loop.
Core Sections
Non-overlapping count with str.count
For standard non-overlapping counts, use the built-in method. It is concise and implemented efficiently.
str.count does not count overlapping matches. That behavior is correct for many text analytics tasks.
Overlapping count with regex lookahead
If overlapping occurrences are required, use a lookahead pattern.
Lookahead checks each position without consuming characters, which enables overlap counting.
Controlled scan loop for full transparency
A manual scan gives explicit control over overlap behavior and can be easier to debug in domain-specific logic.
This approach is useful when requirements are explicit and test-heavy.
Case-insensitive and normalized matching
For user text, normalize case and optionally Unicode representation before counting.
Normalization reduces surprises in multilingual data.
Performance guidance
For one-off checks in moderate text, str.count is usually best. For complex matching rules, regex can be expressive but sometimes slower. For very large streams, process chunk by chunk and account for boundary overlap between chunks. Always benchmark with representative input, because algorithm choice that looks fastest on tiny strings can regress on real workloads.
Streaming approach for very large files
When input does not fit comfortably in memory, process text in chunks and keep a trailing window to preserve boundary matches.
This method keeps memory bounded while preserving correctness at chunk boundaries. For overlap counting in streams, use a custom scan function instead of raw count.
Test cases that prevent subtle regressions
Maintain a compact table of expected counts for overlap and non-overlap rules. Include empty input, repeated characters, punctuation, and Unicode examples. These cases prevent accidental behavior changes when optimizing logic or replacing regex patterns.
For search-heavy services, document whether matching is literal or pattern-based and whether normalization runs before counting. Clear contracts prevent data teams and backend teams from reporting different metrics from the same text source.
Common Pitfalls
- Assuming
str.countincludes overlapping matches. - Ignoring case-folding requirements in user-generated text.
- Forgetting Unicode normalization for visually similar characters.
- Treating empty substring rules inconsistently across code paths.
- Benchmarking with tiny examples and extrapolating to production workloads.
Summary
- Use
str.countfor straightforward non-overlapping substring counts. - Use regex lookahead or a manual scan for overlapping counts.
- Normalize case and Unicode when text sources are heterogeneous.
- Make overlap behavior explicit in function design and tests.
- Choose the method that matches your correctness rules before optimizing.

