Algorithm to find articles with similar text
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Finding articles with similar text is not one single algorithmic problem. The right approach depends on whether you care about exact overlap, near-duplicate detection, topic similarity, or deeper semantic similarity where different wording can still mean the same thing.
Start with the Similarity Goal
Before choosing a model, decide what “similar” means for your use case.
Examples:
- duplicate or near-duplicate pages
- articles about the same topic with similar vocabulary
- semantically related articles with different wording
- recommendation systems where broad topical similarity is enough
That definition determines whether simple lexical methods are enough or whether you need embedding-based retrieval.
A Strong Baseline: TF-IDF Plus Cosine Similarity
For many article corpora, TF-IDF with cosine similarity is the simplest good baseline.
This works well when overlap in important terms is a good proxy for similarity.
Near-Duplicate Detection at Scale
If the main problem is near-duplicate detection across a large corpus, techniques such as shingling, MinHash, and locality-sensitive hashing are often more appropriate.
The idea is to turn each article into sets of token shingles, estimate Jaccard similarity cheaply, and avoid comparing every document with every other document.
This is especially useful when the corpus is huge and you care more about duplicate or heavily overlapping content than about subtle semantic meaning.
Semantic Similarity with Embeddings
When articles can be similar without sharing many literal words, embedding models are often better.
A sentence or document embedding maps each article to a dense vector. Similarity is then computed in vector space.
This kind of approach is much better at capturing paraphrase-like similarity.
Indexing Matters for Large Corpora
Even a good similarity model becomes impractical if you compare every article against every other article naively. For production retrieval, pair the representation with an index.
Typical options include:
- approximate nearest neighbor search for embeddings
- inverted indexes for lexical retrieval
- MinHash LSH for near-duplicate detection
The representation and the retrieval method should be chosen together, not separately.
Preprocessing Still Matters
Regardless of the algorithm, preprocessing improves signal quality.
Common steps include:
- lowercasing
- tokenization
- stopword handling
- stemming or lemmatization when appropriate
- removing boilerplate such as nav text or repeated footer content
If the corpus contains a lot of repeated template text, cleaning that out may improve similarity more than changing the model.
Common Pitfalls
The most common mistake is picking a sophisticated model before defining what kind of similarity matters.
Another issue is using raw keyword overlap when the task actually needs semantic understanding, or using heavy semantic models when the real task is just duplicate detection.
People also forget about scalability. Pairwise comparison across a large article collection becomes expensive quickly without indexing or candidate generation.
Finally, do not ignore boilerplate. Header and footer text can make unrelated articles look artificially similar if preprocessing is weak.
Summary
- Choose the algorithm based on the kind of similarity you actually care about.
- TF-IDF plus cosine similarity is a strong lexical baseline.
- MinHash and LSH are good for near-duplicate detection at scale.
- Embedding-based methods are better for semantic similarity.
- Representation quality, preprocessing, and indexing matter as much as the similarity formula itself.

