locality-sensitive hashing
min-hash
LSH algorithm
similarity measurement
data science techniques

Implementation of locality-sensitive hashing with min-hash

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

MinHash and locality-sensitive hashing are used together to find similar sets without comparing every pair directly. MinHash compresses a set into a short signature that approximates Jaccard similarity, and LSH groups similar signatures into buckets so only promising candidate pairs need detailed comparison.

Start With Jaccard Similarity

For two sets A and B, Jaccard similarity is:

  • intersection size divided by union size

If two documents share many shingles, their Jaccard similarity is high. The problem is that direct set comparison across many documents is expensive.

MinHash helps by turning each set into a compact signature. The probability that two MinHash values match under a random permutation equals their Jaccard similarity.

Build MinHash Signatures

In practice, people simulate random permutations with a family of hash functions. Each hash function maps elements to integers, and the signature stores the minimum hash value seen for that set.

python
1import hashlib
2
3
4def stable_hash(value: str, seed: int) -> int:
5    data = f"{seed}:{value}".encode("utf-8")
6    return int(hashlib.md5(data).hexdigest(), 16)
7
8
9def minhash_signature(items: set[str], num_hashes: int = 50) -> list[int]:
10    signature = []
11    for seed in range(num_hashes):
12        signature.append(min(stable_hash(item, seed) for item in items))
13    return signature
14
15
16doc1 = {"cat", "sat", "mat"}
17doc2 = {"cat", "sat", "rug"}
18
19print(minhash_signature(doc1, 5))
20print(minhash_signature(doc2, 5))

More matching positions in the signatures imply a higher estimated Jaccard similarity.

Turn Signatures Into LSH Buckets

LSH splits a signature into bands. Documents that match exactly within at least one band are treated as candidate pairs.

python
1from collections import defaultdict
2
3
4def lsh_buckets(signatures: dict[str, list[int]], rows_per_band: int):
5    buckets = defaultdict(list)
6    signature_length = len(next(iter(signatures.values())))
7    bands = signature_length // rows_per_band
8
9    for doc_id, signature in signatures.items():
10        for band in range(bands):
11            start = band * rows_per_band
12            end = start + rows_per_band
13            band_key = (band, tuple(signature[start:end]))
14            buckets[band_key].append(doc_id)
15
16    return buckets
17
18
19signatures = {
20    "doc1": minhash_signature(doc1, 20),
21    "doc2": minhash_signature(doc2, 20),
22    "doc3": minhash_signature({"apple", "banana", "pear"}, 20),
23}
24
25buckets = lsh_buckets(signatures, rows_per_band=5)
26for key, docs in buckets.items():
27    if len(docs) > 1:
28        print(key, docs)

The bucket step does not prove similarity. It narrows the search to likely matches.

Typical Workflow

A practical document-similarity pipeline usually looks like this:

  • convert each document to a set of shingles
  • compute a MinHash signature for each set
  • split signatures into bands for LSH
  • collect candidate pairs from shared buckets
  • compute exact similarity only for those candidates

That is what makes the method scalable.

Parameter Tradeoffs

More hash functions improve the MinHash estimate but increase compute and storage cost. Band size changes the recall and precision tradeoff.

  • more rows per band means fewer false positives but more false negatives
  • fewer rows per band means more candidates, including weaker matches

There is no universal best setting. It depends on the similarity threshold you care about.

Common Pitfalls

A common mistake is skipping shingling and hashing whole documents as single tokens. MinHash is about set similarity, so tokenization or shingling quality matters a lot.

Another mistake is assuming LSH returns exact nearest neighbors. It returns candidates with high probability, not a proof of optimal similarity.

It is also easy to use Python's built-in hash() for persisted signatures. That hash is not stable across processes by default, so a deterministic hash function is safer for repeatable results.

Summary

  • MinHash estimates Jaccard similarity between sets.
  • LSH uses signature bands to find candidate pairs efficiently.
  • Good tokenization or shingling is essential before hashing.
  • Banding parameters control the precision and recall tradeoff.
  • LSH is a filtering step, so exact similarity is usually computed afterward on candidates.

Course illustration
Course illustration

All Rights Reserved.