string similarity
algorithms
text comparison
computational linguistics
pattern matching

What string similarity algorithms are there?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

String similarity algorithms are essential in diverse fields like data mining, information retrieval, and natural language processing. They are used to compare two text strings to determine how similar they are. Applications range from spell-checking and plagiarism detection to database record linkage and genome analysis. This article delves into various string similarity algorithms, explores their technical foundations, and provides examples for better understanding.

Categories of String Similarity Algorithms

String similarity measures can be broadly categorized into:

  1. Edit-Based Similarity: Measures the minimum number of edits required to transform one string into another.
  2. Token-Based Similarity: Considers substrings or tokens within the strings.
  3. Character-Based Similarity: Based on overlapping substrings.
  4. Vector-Based Similarity: Uses mathematical constructs like vectors to measure similarity.

Edit-Based Similarity Algorithms

1. Levenshtein Distance

Levenshtein Distance measures how many single-character edits (insertions, deletions, or substitutions) are needed to change one string into another.

Example:

Let's compute Levenshtein Distance between "kitten" and "sitting":

• kitten → sitten (substitute 'k' with 's') • sitten → sittin (substitute 'e' with 'i') • sittin → sitting (insert 'g')

The Levenshtein Distance is 3.

Formula:

For two strings aa and bb, leva,b(i,j)lev_{a,b}(i, j) is defined as the distance between the first ii characters of aa and the first jj characters of bb:

lev_a,b(i,j)={max(i,j)if min(i,j)=0,min{lev_a,b(i1,j)+1,lev_a,b(i,j1)+1,lev_a,b(i1,j1)+cost(a_i,b_j)otherwiselev\_{a,b}(i, j) = \begin{cases} \max(i, j) & \text{if }\min(i, j) = 0,\\ \min \begin{cases} lev\_{a,b}(i-1, j) + 1,\\ lev\_{a,b}(i, j-1) + 1,\\ lev\_{a,b}(i-1, j-1) + cost(a\_i, b\_j) \end{cases} & \text{otherwise} \end{cases}

where cost(ai,bj)cost(a_i, b_j) is 0 when ai=bja_i = b_j, otherwise 1.

2. Damerau-Levenshtein Distance

An extension of Levenshtein Distance that also accounts for transpositions (swaps of two adjacent characters).

Example:

Consider the strings "ca" and "ac". The Damerau-Levenshtein Distance between these strings is 1, as only one transposition is required.

Token-Based Similarity Algorithms

3. Jaccard Similarity

Jaccard Similarity is the size of the intersection of token sets divided by the size of the union of token sets.

Example:

For sets $A = \{1, 2, 3\}$ and $B = \{2, 3, 4\}$, the Jaccard Similarity is:

J(A,B)=ABAB=24=0.5J(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{2}{4} = 0.5

4. Cosine Similarity

Cosine similarity measures the cosine of the angle between two n-dimensional vectors of an inner product space. It's essentially a measurement of orientation, not magnitude.

Example:

For vectors A=(1,1,0)A = (1, 1, 0) and B=(0,1,1)B = (0, 1, 1), the cosine similarity is:

cos(θ)=ABAB=122=0.5\cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|} = \frac{1}{\sqrt{2} \cdot \sqrt{2}} = 0.5

Character-Based Similarity Algorithms

5. N-gram

N-gram similarity breaks strings into n-grams and measures how many n-grams two strings share. For instance, with n=2, "night" becomes ("ni", "ig", "gh", "ht").

Example:

For n=2, "hello" becomes {"he", "el", "ll", "lo"}. Comparing with "yellow" gives {"ye", "el", "ll", "lo", "ow"}. This method typically counts matching n-grams.

Vector-Based Similarity Algorithms

6. Euclidean Distance

Euclidean distance measures the "straight line" distance between two points in n-dimensional space, applicable in cases where strings are represented as vectors.

Example:

Given vectors P=(1,2)P = (1, 2) and Q=(4,6)Q = (4, 6), the Euclidean distance is:

d(P,Q)=(41)2+(62)2=9+16=5d(P, Q) = \sqrt{(4-1)^2 + (6-2)^2} = \sqrt{9 + 16} = 5


Summary Table

AlgorithmTypeDescriptionExample
Levenshtein DistanceEdit-BasedCounts single-character edits"kitten" vs "sitting" => 3 edits needed
Damerau-Levenshtein DistanceEdit-BasedIncludes transpositions in edit calculations"ca" vs "ac" => 1 edit needed
Jaccard SimilarityToken-BasedIntersection over Union of token setsSets {1,2,3} & {2,3,4} => 0.5
Cosine SimilarityToken-BasedUses vector space and angle measurementVectors (1,1,0) & (0,1,1) => Similarity = 0.5
N-gramCharacter-BasedBreaks the string into overlapping n-length substringsN=2, "hello" => {"he", "el", "ll", "lo"}
Euclidean DistanceVector-BasedMeasures the straight-line distance in an n-dimensional spacePoints (1,2) & (4,6) => Distance = 5

Conclusion

Choosing the right string similarity algorithm depends on the specific use case and the nature of the data. Edit-based algorithms are excellent for small text differences, while token-based algorithms work well with structured data. Character-based approaches like n-grams are useful for text that has many variations and where partial matches are significant. Vector-based methods can handle data with known geometric interpretations. Each algorithm has its strengths and trade-offs, making it essential to understand the context before selecting an approach for text processing tasks.


Course illustration
Course illustration

All Rights Reserved.