Computing similarity between two lists

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Computing the similarity between two lists is a fundamental task in data science, machine learning, and information retrieval. It powers recommendation systems, search engines, plagiarism detection, and clustering algorithms. This article covers four widely used similarity and distance measures: Jaccard Index, Cosine Similarity, Hamming Distance, and Euclidean Distance, each with formulas, worked examples, and guidance on when to use them.

Jaccard Index

The Jaccard Index measures similarity between two sets by comparing the size of their intersection to the size of their union:

$J(A, B) = \frac{|A \cap B|}{|A \cup B|}$

The result ranges from $0$ (completely disjoint) to $1$ (identical sets). For lists, convert them to sets first, which discards duplicates and order.

Example

For lists A = [1, 2, 2, 3] and B = [3, 2, 1, 4]:

Convert to sets: $\text{Set}(A) = {1, 2, 3}$ , $\text{Set}(B) = {1, 2, 3, 4}$
$A \cap B = {1, 2, 3}$ , so $|A \cap B| = 3$
$A \cup B = {1, 2, 3, 4}$ , so $|A \cup B| = 4$
Jaccard Index = $\frac{3}{4} = 0.75$

When to Use

The Jaccard Index is best for comparing sets of categorical items where order and frequency do not matter. Think "what fraction of items do these two collections share?"

Cosine Similarity

Cosine similarity measures the cosine of the angle between two vectors. It captures directional similarity regardless of magnitude:

$\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \cdot \|\mathbf{B}\|}$

The result ranges from $0$ (orthogonal, no similarity) to $1$ (same direction) for non-negative vectors. It can be negative for vectors with negative components.

Example

For binary vectors A = [1, 1, 0, 1] and B = [1, 0, 1, 1]:

Dot product: $\mathbf{A} \cdot \mathbf{B} = (1)(1) + (1)(0) + (0)(1) + (1)(1) = 2$
Magnitude of A: $\|\mathbf{A}\| = \sqrt{1^2 + 1^2 + 0^2 + 1^2} = \sqrt{3}$
Magnitude of B: $\|\mathbf{B}\| = \sqrt{1^2 + 0^2 + 1^2 + 1^2} = \sqrt{3}$
Cosine similarity = $\frac{2}{3} \approx 0.667$

When to Use

Cosine similarity excels when comparing vectors where magnitude is irrelevant but direction matters. This is common in text analysis (comparing TF-IDF vectors of documents) and recommendation systems.

Hamming Distance

The Hamming distance counts the number of positions at which two equal-length sequences differ:

$d_H(A, B) = \sum_{i=1}^{n} \mathbb{1}(A_i \neq B_i)$

where $\mathbb{1}$ is the indicator function. The result ranges from $0$ (identical) to $n$ (completely different).

Example

For lists A = [1, 0, 1, 1] and B = [1, 1, 0, 1]:

Position 1: $1 = 1$ (match)
Position 2: $0 \neq 1$ (differ)
Position 3: $1 \neq 0$ (differ)
Position 4: $1 = 1$ (match)
Hamming distance = $2$

When to Use

Hamming distance requires equal-length sequences and is ideal for comparing binary strings, error-detecting codes, or fixed-length feature vectors.

Euclidean Distance

Euclidean distance is the straight-line distance between two points in n-dimensional space:

$d_E(A, B) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}$

The result ranges from $0$ (identical points) to infinity.

Example

For numerical lists A = [1, 2, 3] and B = [4, 5, 6]:

$(1-4)^2 + (2-5)^2 + (3-6)^2 = 9 + 9 + 9 = 27$
Euclidean distance = $\sqrt{27} \approx 5.196$

When to Use

Euclidean distance works well when the absolute positions of data points matter and the dimensions are on comparable scales. It is the default distance metric for k-nearest neighbors and many clustering algorithms.

Choosing the Right Measure

Attribute	Jaccard Index	Cosine Similarity	Hamming Distance	Euclidean Distance
Data type	Sets	Vectors	Equal-length sequences	Numerical vectors
Handles duplicates	No (uses sets)	Yes	No	Yes
Length requirement	No	Yes	Yes	Yes
Output range	$[0, 1]$	$[-1, 1]$	$[0, n]$	$[0, \infty)$
Sensitive to magnitude	No	No	N/A	Yes

Common Pitfalls

Using Jaccard on lists where duplicate counts matter. Jaccard discards duplicates by converting to sets, so [1, 1, 1] and [1] are treated as identical.
Using Euclidean distance on features with very different scales (e.g., age in years vs. salary in dollars). Normalize or standardize features first.
Comparing lists of different lengths with Hamming or Euclidean distance. Both require equal-length inputs.
Forgetting that cosine similarity for non-negative vectors (like TF-IDF) ranges from 0 to 1, but for general vectors it ranges from -1 to 1.

Conclusion

Choosing the right similarity measure depends on the data type and the question you are answering. Jaccard works for set overlap, cosine similarity captures directional alignment, Hamming distance counts positional differences, and Euclidean distance measures geometric separation. Understanding each measure's assumptions ensures you select the right tool for your specific task.