Implementation of locality-sensitive hashing with min-hash
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
MinHash and locality-sensitive hashing are used together to find similar sets without comparing every pair directly. MinHash compresses a set into a short signature that approximates Jaccard similarity, and LSH groups similar signatures into buckets so only promising candidate pairs need detailed comparison.
Start With Jaccard Similarity
For two sets A and B, Jaccard similarity is:
- intersection size divided by union size
If two documents share many shingles, their Jaccard similarity is high. The problem is that direct set comparison across many documents is expensive.
MinHash helps by turning each set into a compact signature. The probability that two MinHash values match under a random permutation equals their Jaccard similarity.
Build MinHash Signatures
In practice, people simulate random permutations with a family of hash functions. Each hash function maps elements to integers, and the signature stores the minimum hash value seen for that set.
More matching positions in the signatures imply a higher estimated Jaccard similarity.
Turn Signatures Into LSH Buckets
LSH splits a signature into bands. Documents that match exactly within at least one band are treated as candidate pairs.
The bucket step does not prove similarity. It narrows the search to likely matches.
Typical Workflow
A practical document-similarity pipeline usually looks like this:
- convert each document to a set of shingles
- compute a MinHash signature for each set
- split signatures into bands for LSH
- collect candidate pairs from shared buckets
- compute exact similarity only for those candidates
That is what makes the method scalable.
Parameter Tradeoffs
More hash functions improve the MinHash estimate but increase compute and storage cost. Band size changes the recall and precision tradeoff.
- more rows per band means fewer false positives but more false negatives
- fewer rows per band means more candidates, including weaker matches
There is no universal best setting. It depends on the similarity threshold you care about.
Common Pitfalls
A common mistake is skipping shingling and hashing whole documents as single tokens. MinHash is about set similarity, so tokenization or shingling quality matters a lot.
Another mistake is assuming LSH returns exact nearest neighbors. It returns candidates with high probability, not a proof of optimal similarity.
It is also easy to use Python's built-in hash() for persisted signatures. That hash is not stable across processes by default, so a deterministic hash function is safer for repeatable results.
Summary
- MinHash estimates Jaccard similarity between sets.
- LSH uses signature bands to find candidate pairs efficiently.
- Good tokenization or shingling is essential before hashing.
- Banding parameters control the precision and recall tradeoff.
- LSH is a filtering step, so exact similarity is usually computed afterward on candidates.

