Get the index of the nth occurrence of a string?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Finding the index of the nth occurrence of a substring is a common text-processing task in parsing, log analysis, and validation code. The main challenge is handling edge cases such as overlapping matches and missing occurrences. A reliable solution should define those rules up front and keep complexity clear.
Core Sections
Define What nth occurrence Means
Before writing code, decide whether matches can overlap. For example, in aaaa, the substring aa appears at indices 0, 1, and 2 if overlap is allowed, but only at 0 and 2 in non-overlapping mode. Most beginner implementations ignore this distinction and return inconsistent results.
Non-overlapping Search with str.find
The fastest practical baseline is repeated find with a moving start index.
This method is easy to read and efficient for normal workloads.
Overlapping Search Variant
If overlap is required, advance by one character instead of len(needle).
Explicit overlap behavior prevents ambiguity in downstream logic.
Regex Option for Pattern-based Matches
If matching rules are more complex than plain substring search, use regular expressions. Regex also supports lookahead-based overlapping matches.
For very large text, avoid materializing all matches when you only need one target occurrence.
Streaming and Large-file Considerations
When processing large files, reading all content into memory may be expensive. Stream line by line, track cumulative offset, and stop as soon as the nth occurrence is found.
This approach scales better for long logs and batch pipelines.
Testing Strategy
Write tests for missing results, first and last occurrences, overlap behavior, and invalid inputs. Keep tests explicit so behavior stays stable during refactoring.
API Design for Reusable Text Utilities
When this logic is reused across services, package it as a utility with explicit options such as overlapping mode and not-found behavior. Returning -1 is common, but some codebases prefer None or raised exceptions for missing occurrences. Pick one style and document it clearly.
This utility form makes call sites easier to read and keeps edge-case behavior centralized.
Common Pitfalls
- Not defining overlap behavior and returning inconsistent indexes.
- Accepting n equals zero silently instead of validating input.
- Forgetting to guard against empty substring searches.
- Building regex-only solutions for simple exact matches and adding unnecessary complexity.
- Materializing all matches when only one target occurrence is needed.
Summary
- Clarify overlap rules before implementation.
- Use repeated find for straightforward non-overlapping searches.
- Use one-step advancement for overlapping match logic.
- Use regex when pattern matching is genuinely needed.
- Add edge-case tests to keep indexing behavior predictable.

