String Comparison
Fuzzy Matching
Tolerance Thresholds
Algorithmic Accuracy
Pattern Recognition

Comparing strings with tolerance

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Comparing strings with tolerance means accepting near matches instead of requiring exact equality. That is useful when input may contain typos, inconsistent casing, spacing differences, or OCR-style noise, but the correct approach depends on what kind of "difference" you are willing to tolerate.

Start by Defining Tolerance

Tolerance can mean several different things:

  • ignore case
  • ignore extra whitespace
  • allow small spelling mistakes
  • allow reordered words
  • allow only a fixed number of edits

Those are not the same problem. Before picking an algorithm, decide whether you want normalization, edit distance, token comparison, or all three.

Normalize First

Many "fuzzy match" problems disappear once the strings are normalized.

python
1def normalize(text: str) -> str:
2    return " ".join(text.casefold().split())
3
4
5left = "  New   York "
6right = "new york"
7
8print(normalize(left) == normalize(right))

This handles case and spacing differences without the cost of a full fuzzy algorithm.

Use Levenshtein Distance for Typo Tolerance

If you want to allow a small number of insertions, deletions, or substitutions, Levenshtein distance is a strong baseline.

python
1def levenshtein(a: str, b: str) -> int:
2    dp = [[0] * (len(b) + 1) for _ in range(len(a) + 1)]
3
4    for i in range(len(a) + 1):
5        dp[i][0] = i
6    for j in range(len(b) + 1):
7        dp[0][j] = j
8
9    for i in range(1, len(a) + 1):
10        for j in range(1, len(b) + 1):
11            cost = 0 if a[i - 1] == b[j - 1] else 1
12            dp[i][j] = min(
13                dp[i - 1][j] + 1,
14                dp[i][j - 1] + 1,
15                dp[i - 1][j - 1] + cost,
16            )
17
18    return dp[-1][-1]
19
20
21print(levenshtein("kitten", "sitting"))
22print(levenshtein("color", "colour"))

The smaller the distance, the closer the strings are.

Turn Distance into a Threshold

Distance alone is not enough. You still need a policy for deciding what counts as "close enough."

python
1def roughly_equal(a: str, b: str, max_distance: int = 1) -> bool:
2    a_norm = normalize(a)
3    b_norm = normalize(b)
4    return levenshtein(a_norm, b_norm) <= max_distance
5
6
7print(roughly_equal("Jon", "John", max_distance=1))
8print(roughly_equal("Jon", "Johnson", max_distance=1))

A fixed threshold works well for short strings. For longer strings, a ratio-based rule is often better because one edit means something very different in a three-character word than in a fifty-character product title.

Use Similarity Ratios for Variable-Length Text

Python's standard library includes difflib, which provides a quick similarity ratio.

python
1from difflib import SequenceMatcher
2
3
4def similarity(a: str, b: str) -> float:
5    return SequenceMatcher(None, normalize(a), normalize(b)).ratio()
6
7
8print(similarity("International Business Machines", "international business machine"))
9print(similarity("cat", "car"))

This is often easier to tune than raw edit distance when strings vary widely in length.

Token-Based Comparison for Reordered Text

If word order is noisy, compare sets or sorted tokens instead of raw character sequences.

python
1def token_key(text: str) -> str:
2    tokens = normalize(text).split()
3    return " ".join(sorted(tokens))
4
5
6print(token_key("york new city"))
7print(token_key("new york city"))
8print(token_key("york new city") == token_key("new york city"))

This is useful for names, tags, and titles where order is less important than content.

Match the Algorithm to the Domain

A good rule is:

  • use normalization for formatting noise
  • use edit distance for typos
  • use token methods for order-insensitive text
  • combine them if the data is messy

Do not jump straight to a heavy fuzzy library if lower-cost normalization already solves the actual problem.

Common Pitfalls

The biggest mistake is treating "tolerance" as one universal setting. Different domains need different definitions of acceptable difference.

Another issue is using the same edit-distance threshold for short and long strings. A distance of two may be huge for a short code and trivial for a long sentence.

A third problem is skipping normalization and then blaming the fuzzy algorithm for differences caused only by casing or whitespace.

Summary

  • Define what kind of difference you want to tolerate before choosing an algorithm.
  • Normalize strings first to remove trivial formatting differences.
  • Use Levenshtein distance when typo tolerance is the main requirement.
  • Use ratio or token-based comparison when string lengths or word order vary.
  • Tune thresholds against real data, not just synthetic examples.

Course illustration
Course illustration

All Rights Reserved.