Algorithms for fuzzy matching strings

fuzzy matching

string matching

algorithms

text processing

computational linguistics

Algorithms for fuzzy matching strings

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Fuzzy string matching is about finding strings that are similar enough, not necessarily identical. The right algorithm depends on what kind of mistakes you expect: character typos, swapped letters, missing accents, reordered tokens, or pronunciation-based spelling differences.

Edit Distance for Typo-Like Errors

When the main problem is small character edits, edit-distance algorithms are a strong starting point. Levenshtein distance counts insertions, deletions, and substitutions.

python

1def levenshtein(a: str, b: str) -> int:
2    dp = [[0] * (len(b) + 1) for _ in range(len(a) + 1)]
3    for i in range(len(a) + 1):
4        dp[i][0] = i
5    for j in range(len(b) + 1):
6        dp[0][j] = j
7
8    for i in range(1, len(a) + 1):
9        for j in range(1, len(b) + 1):
10            cost = 0 if a[i - 1] == b[j - 1] else 1
11            dp[i][j] = min(
12                dp[i - 1][j] + 1,
13                dp[i][j - 1] + 1,
14                dp[i - 1][j - 1] + cost,
15            )
16    return dp[-1][-1]
17
18print(levenshtein('kitten', 'sitting'))

This works well for spelling mistakes and short text fields. If adjacent transpositions such as form versus from matter, Damerau-Levenshtein can be a better fit.

Token-Based Matching for Multi-Word Fields

For names, titles, or addresses, token order is often less important than token presence. Token-based metrics handle that better than pure character distance.

python

1def jaccard(a: str, b: str) -> float:
2    left = set(a.lower().split())
3    right = set(b.lower().split())
4    if not left and not right:
5        return 1.0
6    return len(left & right) / len(left | right)
7
8print(jaccard('new york city', 'york new city'))

This is robust to word reordering, but it does not notice small spelling mistakes inside each token. That is why many practical systems combine token-based and character-based signals.

N-Gram Similarity for Noisy Text

Character n-grams are a good middle ground when strings may have partial overlap, spacing noise, or missing characters.

python

1def trigrams(s: str):
2    s = f'  {s.lower()} '
3    return {s[i:i+3] for i in range(len(s) - 2)}
4
5
6def dice(a: str, b: str) -> float:
7    A, B = trigrams(a), trigrams(b)
8    if not A and not B:
9        return 1.0
10    if not A or not B:
11        return 0.0
12    return 2 * len(A & B) / (len(A) + len(B))
13
14print(dice('algorithm', 'algoritm'))

N-grams are especially useful in large-scale candidate generation because they work well with inverted indexes and other fast retrieval structures.

Phonetic Algorithms for Sound-Alike Matching

When spelling varies but pronunciation remains similar, phonetic encodings such as Soundex or Metaphone can improve recall. These are common in person-name matching and deduplication systems.

Phonetic methods are rarely enough on their own. They are usually used as one feature among several rather than as the final decision rule.

Choose the Algorithm by Error Pattern

A useful mental model is:

use edit distance for typo-like character errors
use token similarity for reordered multi-word fields
use n-grams for noisy partial overlap and scalable candidate search
use phonetic encodings when pronunciation matters

The best production systems often combine several of these and then tune thresholds on labeled examples.

Scaling Matters More Than the Metric Alone

On a large dataset, the real challenge is often candidate generation rather than scoring. Running an expensive fuzzy metric against every possible pair is usually too slow.

Practical systems therefore normalize first, block into candidate groups, retrieve a short list quickly, and only then compute more expensive similarity scores.

Common Pitfalls

Using one fuzzy metric for every field type ignores how different data errors really behave.

Skipping normalization such as lowercasing, accent folding, or punctuation cleanup wastes score quality before the algorithm even starts.

Running expensive edit distance across all pairs without blocking does not scale.

Summary

Fuzzy matching is not one algorithm but a family of strategies for different error patterns.
Edit distance, token similarity, n-grams, and phonetic encodings solve different problems.
The right choice depends on the data and on what kinds of mistakes you expect.
At scale, candidate generation and threshold tuning matter as much as the scoring metric itself.