What string similarity algorithms are there?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
String similarity algorithms are essential in diverse fields like data mining, information retrieval, and natural language processing. They are used to compare two text strings to determine how similar they are. Applications range from spell-checking and plagiarism detection to database record linkage and genome analysis. This article delves into various string similarity algorithms, explores their technical foundations, and provides examples for better understanding.
Categories of String Similarity Algorithms
String similarity measures can be broadly categorized into:
- Edit-Based Similarity: Measures the minimum number of edits required to transform one string into another.
- Token-Based Similarity: Considers substrings or tokens within the strings.
- Character-Based Similarity: Based on overlapping substrings.
- Vector-Based Similarity: Uses mathematical constructs like vectors to measure similarity.
Edit-Based Similarity Algorithms
1. Levenshtein Distance
Levenshtein Distance measures how many single-character edits (insertions, deletions, or substitutions) are needed to change one string into another.
Example:
Let's compute Levenshtein Distance between "kitten" and "sitting":
• kitten → sitten (substitute 'k' with 's') • sitten → sittin (substitute 'e' with 'i') • sittin → sitting (insert 'g')
The Levenshtein Distance is 3.
Formula:
For two strings and , is defined as the distance between the first characters of and the first characters of :
where is 0 when , otherwise 1.
2. Damerau-Levenshtein Distance
An extension of Levenshtein Distance that also accounts for transpositions (swaps of two adjacent characters).
Example:
Consider the strings "ca" and "ac". The Damerau-Levenshtein Distance between these strings is 1, as only one transposition is required.
Token-Based Similarity Algorithms
3. Jaccard Similarity
Jaccard Similarity is the size of the intersection of token sets divided by the size of the union of token sets.
Example:
For sets $A = \{1, 2, 3\}$ and $B = \{2, 3, 4\}$, the Jaccard Similarity is:
4. Cosine Similarity
Cosine similarity measures the cosine of the angle between two n-dimensional vectors of an inner product space. It's essentially a measurement of orientation, not magnitude.
Example:
For vectors and , the cosine similarity is:
Character-Based Similarity Algorithms
5. N-gram
N-gram similarity breaks strings into n-grams and measures how many n-grams two strings share. For instance, with n=2, "night" becomes ("ni", "ig", "gh", "ht").
Example:
For n=2, "hello" becomes {"he", "el", "ll", "lo"}. Comparing with "yellow" gives {"ye", "el", "ll", "lo", "ow"}. This method typically counts matching n-grams.
Vector-Based Similarity Algorithms
6. Euclidean Distance
Euclidean distance measures the "straight line" distance between two points in n-dimensional space, applicable in cases where strings are represented as vectors.
Example:
Given vectors and , the Euclidean distance is:
Summary Table
| Algorithm | Type | Description | Example |
| Levenshtein Distance | Edit-Based | Counts single-character edits | "kitten" vs "sitting" => 3 edits needed |
| Damerau-Levenshtein Distance | Edit-Based | Includes transpositions in edit calculations | "ca" vs "ac" => 1 edit needed |
| Jaccard Similarity | Token-Based | Intersection over Union of token sets | Sets {1,2,3} & {2,3,4} => 0.5 |
| Cosine Similarity | Token-Based | Uses vector space and angle measurement | Vectors (1,1,0) & (0,1,1) => Similarity = 0.5 |
| N-gram | Character-Based | Breaks the string into overlapping n-length substrings | N=2, "hello" => {"he", "el", "ll", "lo"} |
| Euclidean Distance | Vector-Based | Measures the straight-line distance in an n-dimensional space | Points (1,2) & (4,6) => Distance = 5 |
Conclusion
Choosing the right string similarity algorithm depends on the specific use case and the nature of the data. Edit-based algorithms are excellent for small text differences, while token-based algorithms work well with structured data. Character-based approaches like n-grams are useful for text that has many variations and where partial matches are significant. Vector-based methods can handle data with known geometric interpretations. Each algorithm has its strengths and trade-offs, making it essential to understand the context before selecting an approach for text processing tasks.

