Computing similarity between two lists
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Computing the similarity between two lists is a fundamental task in data science, machine learning, and information retrieval. It powers recommendation systems, search engines, plagiarism detection, and clustering algorithms. This article covers four widely used similarity and distance measures: Jaccard Index, Cosine Similarity, Hamming Distance, and Euclidean Distance, each with formulas, worked examples, and guidance on when to use them.
Jaccard Index
The Jaccard Index measures similarity between two sets by comparing the size of their intersection to the size of their union:
The result ranges from (completely disjoint) to (identical sets). For lists, convert them to sets first, which discards duplicates and order.
Example
For lists A = [1, 2, 2, 3] and B = [3, 2, 1, 4]:
- Convert to sets: ,
- , so
- , so
- Jaccard Index =
When to Use
The Jaccard Index is best for comparing sets of categorical items where order and frequency do not matter. Think "what fraction of items do these two collections share?"
Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors. It captures directional similarity regardless of magnitude:
The result ranges from (orthogonal, no similarity) to (same direction) for non-negative vectors. It can be negative for vectors with negative components.
Example
For binary vectors A = [1, 1, 0, 1] and B = [1, 0, 1, 1]:
- Dot product:
- Magnitude of A:
- Magnitude of B:
- Cosine similarity =
When to Use
Cosine similarity excels when comparing vectors where magnitude is irrelevant but direction matters. This is common in text analysis (comparing TF-IDF vectors of documents) and recommendation systems.
Hamming Distance
The Hamming distance counts the number of positions at which two equal-length sequences differ:
where is the indicator function. The result ranges from (identical) to (completely different).
Example
For lists A = [1, 0, 1, 1] and B = [1, 1, 0, 1]:
- Position 1: (match)
- Position 2: (differ)
- Position 3: (differ)
- Position 4: (match)
- Hamming distance =
When to Use
Hamming distance requires equal-length sequences and is ideal for comparing binary strings, error-detecting codes, or fixed-length feature vectors.
Euclidean Distance
Euclidean distance is the straight-line distance between two points in n-dimensional space:
The result ranges from (identical points) to infinity.
Example
For numerical lists A = [1, 2, 3] and B = [4, 5, 6]:
- Euclidean distance =
When to Use
Euclidean distance works well when the absolute positions of data points matter and the dimensions are on comparable scales. It is the default distance metric for k-nearest neighbors and many clustering algorithms.
Choosing the Right Measure
| Attribute | Jaccard Index | Cosine Similarity | Hamming Distance | Euclidean Distance |
| Data type | Sets | Vectors | Equal-length sequences | Numerical vectors |
| Handles duplicates | No (uses sets) | Yes | No | Yes |
| Length requirement | No | Yes | Yes | Yes |
| Output range | ||||
| Sensitive to magnitude | No | No | N/A | Yes |
Common Pitfalls
- Using Jaccard on lists where duplicate counts matter. Jaccard discards duplicates by converting to sets, so
[1, 1, 1]and[1]are treated as identical. - Using Euclidean distance on features with very different scales (e.g., age in years vs. salary in dollars). Normalize or standardize features first.
- Comparing lists of different lengths with Hamming or Euclidean distance. Both require equal-length inputs.
- Forgetting that cosine similarity for non-negative vectors (like TF-IDF) ranges from 0 to 1, but for general vectors it ranges from -1 to 1.
Conclusion
Choosing the right similarity measure depends on the data type and the question you are answering. Jaccard works for set overlap, cosine similarity captures directional alignment, Hamming distance counts positional differences, and Euclidean distance measures geometric separation. Understanding each measure's assumptions ensures you select the right tool for your specific task.

