Algorithm to find articles with similar text

text similarity

algorithm development

document analysis

machine learning

natural language processing

Algorithm to find articles with similar text

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Finding articles with similar text is not one single algorithmic problem. The right approach depends on whether you care about exact overlap, near-duplicate detection, topic similarity, or deeper semantic similarity where different wording can still mean the same thing.

Start with the Similarity Goal

Before choosing a model, decide what “similar” means for your use case.

Examples:

duplicate or near-duplicate pages
articles about the same topic with similar vocabulary
semantically related articles with different wording
recommendation systems where broad topical similarity is enough

That definition determines whether simple lexical methods are enough or whether you need embedding-based retrieval.

A Strong Baseline: TF-IDF Plus Cosine Similarity

For many article corpora, TF-IDF with cosine similarity is the simplest good baseline.

python

1from sklearn.feature_extraction.text import TfidfVectorizer
2from sklearn.metrics.pairwise import cosine_similarity
3
4articles = [
5    "Python memory profiling and leak detection techniques.",
6    "How to diagnose memory leaks in .NET applications.",
7    "Garbage collection and memory profiling in Python."
8]
9
10vectorizer = TfidfVectorizer(stop_words="english")
11X = vectorizer.fit_transform(articles)
12
13scores = cosine_similarity(X)
14print(scores)

This works well when overlap in important terms is a good proxy for similarity.

Near-Duplicate Detection at Scale

If the main problem is near-duplicate detection across a large corpus, techniques such as shingling, MinHash, and locality-sensitive hashing are often more appropriate.

The idea is to turn each article into sets of token shingles, estimate Jaccard similarity cheaply, and avoid comparing every document with every other document.

This is especially useful when the corpus is huge and you care more about duplicate or heavily overlapping content than about subtle semantic meaning.

Semantic Similarity with Embeddings

When articles can be similar without sharing many literal words, embedding models are often better.

A sentence or document embedding maps each article to a dense vector. Similarity is then computed in vector space.

python

1from sentence_transformers import SentenceTransformer
2from sklearn.metrics.pairwise import cosine_similarity
3
4model = SentenceTransformer("all-MiniLM-L6-v2")
5articles = [
6    "How to fix memory leaks in C# applications.",
7    "Diagnosing retained objects in .NET programs.",
8    "Best hiking trails near the ocean."
9]
10
11embeddings = model.encode(articles)
12print(cosine_similarity(embeddings))

This kind of approach is much better at capturing paraphrase-like similarity.

Indexing Matters for Large Corpora

Even a good similarity model becomes impractical if you compare every article against every other article naively. For production retrieval, pair the representation with an index.

Typical options include:

approximate nearest neighbor search for embeddings
inverted indexes for lexical retrieval
MinHash LSH for near-duplicate detection

The representation and the retrieval method should be chosen together, not separately.

Preprocessing Still Matters

Regardless of the algorithm, preprocessing improves signal quality.

Common steps include:

lowercasing
tokenization
stopword handling
stemming or lemmatization when appropriate
removing boilerplate such as nav text or repeated footer content

If the corpus contains a lot of repeated template text, cleaning that out may improve similarity more than changing the model.

Common Pitfalls

The most common mistake is picking a sophisticated model before defining what kind of similarity matters.

Another issue is using raw keyword overlap when the task actually needs semantic understanding, or using heavy semantic models when the real task is just duplicate detection.

People also forget about scalability. Pairwise comparison across a large article collection becomes expensive quickly without indexing or candidate generation.

Finally, do not ignore boilerplate. Header and footer text can make unrelated articles look artificially similar if preprocessing is weak.

Summary

Choose the algorithm based on the kind of similarity you actually care about.
TF-IDF plus cosine similarity is a strong lexical baseline.
MinHash and LSH are good for near-duplicate detection at scale.
Embedding-based methods are better for semantic similarity.
Representation quality, preprocessing, and indexing matter as much as the similarity formula itself.