string algorithms
substring search
algorithm design
computational linguistics
string analysis

Algorithm to find the most common substrings in a string

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Finding the most common substrings is a frequency-analysis problem, but algorithm choice depends on whether substring length is fixed or variable. A brute-force approach can explode in cost for long strings, so practical solutions use sliding windows, counters, and selective length policies. The best implementation starts with a clear definition of what counts as a valid substring and whether overlaps are allowed.

Define the Exact Problem First

Before coding, decide these rules:

  • fixed length only or range of lengths
  • overlapping matches allowed or disallowed
  • case-sensitive or normalized text
  • include punctuation or strip it

Without these rules, two correct implementations can produce different outputs.

For most text analytics tasks, overlapping matches are allowed and comparison is case-sensitive unless explicitly normalized.

Fixed-Length Sliding Window with Counter

For one length k, sliding-window counting is simple and efficient.

python
1from collections import Counter
2from typing import List, Tuple
3
4
5def top_k_substrings_fixed(s: str, k: int, top_n: int = 5) -> List[Tuple[str, int]]:
6    if k <= 0:
7        raise ValueError("k must be positive")
8    if k > len(s):
9        return []
10
11    counts = Counter(s[i : i + k] for i in range(len(s) - k + 1))
12    return counts.most_common(top_n)
13
14
15print(top_k_substrings_fixed("banana", 2, 3))

This runs in O of n windows for fixed k and is often enough for log token or DNA k-mer style analysis.

Multiple Lengths in One Pass per Length

If you need several lengths, repeat fixed-window counting for each length in selected range.

python
1from collections import Counter
2from typing import Dict, Tuple
3
4
5def top_k_substrings_multi(s: str, min_len: int, max_len: int, top_n: int = 3) -> Dict[int, list[Tuple[str, int]]]:
6    if min_len <= 0 or max_len < min_len:
7        raise ValueError("invalid length range")
8
9    out = {}
10    for k in range(min_len, max_len + 1):
11        if k > len(s):
12            out[k] = []
13            continue
14        counts = Counter(s[i : i + k] for i in range(len(s) - k + 1))
15        out[k] = counts.most_common(top_n)
16    return out
17
18
19print(top_k_substrings_multi("abracadabra", 2, 4, top_n=2))

This approach is clear and maintainable, especially when input strings are moderate in size.

Handling Large Inputs Efficiently

For very large strings, substring materialization can allocate many temporary strings. Practical optimizations include:

  • restricting candidate lengths
  • processing text in chunks when full-memory scan is not required
  • using rolling hash or suffix structures for advanced workloads

For many production pipelines, the biggest gain comes from reducing length range, not from exotic data structures.

If the problem grows beyond moderate text sizes or requires many substring lengths, suffix arrays or suffix automata become worth considering. They are more complex to implement, but they can avoid repeated rescans when the workload justifies the added algorithmic machinery.

Normalization Strategy

Normalization changes frequency results significantly. Decide it explicitly.

python
1import re
2
3
4def normalize_text(s: str) -> str:
5    s = s.lower()
6    s = re.sub(r"\s+", " ", s.strip())
7    return s
8
9
10text = "Data data, DATA!"
11print(normalize_text(text))

If punctuation and case are not meaningful, normalize before counting.

Overlap Versus Non-Overlap Counting

The previous examples count overlapping substrings. For non-overlapping counts, increment index by k after a match window rather than one character.

python
1def non_overlapping_counts(s: str, k: int):
2    out = {}
3    i = 0
4    while i + k <= len(s):
5        sub = s[i : i + k]
6        out[sub] = out.get(sub, 0) + 1
7        i += k
8    return out
9
10
11print(non_overlapping_counts("aaaaaa", 2))

Use non-overlap only when domain semantics require chunked segmentation.

Returning Ties and Deterministic Ordering

Counter.most_common can produce tie orders based on insertion details. If deterministic lexical order is needed for equal counts, post-sort explicitly.

python
1from collections import Counter
2
3
4def top_with_stable_ties(s: str, k: int):
5    counts = Counter(s[i : i + k] for i in range(len(s) - k + 1))
6    return sorted(counts.items(), key=lambda p: (-p[1], p[0]))

Deterministic outputs matter for tests and reproducible reports.

If you publish the results to users or downstream systems, document the tie rule explicitly. "Most common" sounds simple, but different tie handling can produce different top lists from the same input.

Validation and Testing

Useful tests:

  • empty input
  • k greater than string length
  • repeated characters such as aaaaa
  • mixed case normalization checks
  • punctuation handling

Testing edge conditions prevents misinterpretation of frequency outputs.

Common Pitfalls

A common pitfall is generating all possible lengths by default and causing unnecessary runtime and memory cost. Another is forgetting to define overlap semantics, which changes counts dramatically on repeated patterns. Teams also often skip normalization policy and compare results between pipelines that process case and punctuation differently. Finally, relying on unstable tie ordering can produce flaky tests.

Summary

  • Most-common-substring detection is mainly a counting problem with clear policy choices.
  • Sliding-window counting is efficient for fixed lengths.
  • Multi-length analysis should use constrained ranges to stay practical.
  • Normalize text intentionally when case or punctuation should be ignored.
  • Define overlap and tie behavior explicitly for consistent results.

Course illustration
Course illustration

All Rights Reserved.