python
document-similarity
algorithms
programming
text-analysis

Algorithm to detect similar documents in python script

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Detecting similar documents in Python depends on both algorithm choice and text preprocessing quality. Exact string matching is rarely enough because similar content can differ by punctuation, word order, or minor edits. A good solution defines similarity metric, threshold policy, and scalability strategy up front.

Choose a Similarity Approach

Common options:

  • TF-IDF with cosine similarity for lexical overlap.
  • N-gram Jaccard similarity for approximate token overlap.
  • Embedding models for semantic similarity.

For many practical scripts, TF-IDF cosine is a strong baseline with predictable behavior.

TF-IDF Cosine Baseline

Use scikit-learn to vectorize corpus and compute pairwise similarity matrix.

python
1from sklearn.feature_extraction.text import TfidfVectorizer
2from sklearn.metrics.pairwise import cosine_similarity
3
4corpus = [
5    "payment failed because card expired",
6    "card expired and payment was declined",
7    "weather forecast for toronto"
8]
9
10vectorizer = TfidfVectorizer(stop_words="english")
11X = vectorizer.fit_transform(corpus)
12sim = cosine_similarity(X)
13
14print(sim.round(3))

This matrix gives similarity score between every pair.

Extract Similar Pairs with Threshold

Convert matrix to pair list using a threshold.

python
1def similar_pairs(sim_matrix, threshold=0.35):
2    out = []
3    n = sim_matrix.shape[0]
4    for i in range(n):
5        for j in range(i + 1, n):
6            score = float(sim_matrix[i, j])
7            if score >= threshold:
8                out.append((i, j, score))
9    return out
10
11print(similar_pairs(sim, 0.3))

Threshold should be tuned on labeled examples, not guessed once.

Preprocessing Has Large Impact

Quality often depends more on preprocessing than algorithm.

Typical steps:

  • Lowercasing.
  • Stop-word filtering.
  • Optional stemming or lemmatization.
  • Removing boilerplate text.

If documents include templates, remove common headers and footers before vectorization to avoid inflated similarity.

Alternative: MinHash for Large Collections

For very large corpora, full pairwise cosine is expensive. MinHash and locality-sensitive hashing can reduce comparisons by finding likely candidates first.

This is useful when you need near-duplicate detection at scale.

Semantic Similarity Option

When lexical overlap is low but meaning is similar, embedding-based models can outperform TF-IDF. Use sentence embeddings and cosine similarity on vectors.

This increases compute cost but improves paraphrase detection.

Evaluation Strategy

Build a small labeled set with examples of:

  • true duplicates.
  • near duplicates.
  • unrelated documents.

Then measure precision and recall across thresholds. Without evaluation, similarity pipelines can produce many false positives or miss important matches.

Production Considerations

For operational use:

  • Cache vectorizer and model objects.
  • Version preprocessing rules.
  • Log top-matching pairs for review.
  • Recompute index incrementally for new documents.

If script runs periodically, store previous signatures to avoid reprocessing unchanged content.

Near-Duplicate Detection with Shingling

For plagiarism and near-duplicate workflows, document-level token overlap can be improved with character or word shingles. This captures local phrase continuity better than bag-of-words alone.

python
1def shingles(text, k=3):
2    words = text.lower().split()
3    return {" ".join(words[i:i+k]) for i in range(len(words)-k+1)}
4
5a = shingles("payment failed because card expired")
6b = shingles("card expired and payment was declined")
7score = len(a & b) / max(1, len(a | b))
8print(score)

This simple metric can complement TF-IDF when local word order matters.

Pipeline Design Tip

In production, use a two-stage pipeline: cheap candidate retrieval first, then expensive similarity scoring. This keeps throughput high while preserving detection quality.## Common Pitfalls

  • Using raw text without normalization and expecting stable scores.
  • Choosing threshold without labeled validation data.
  • Running full pairwise comparison on large corpora without candidate filtering.
  • Treating lexical similarity as semantic equivalence in all domains.
  • Ignoring multilingual content requirements in one-language preprocessing.

Summary

  • Document similarity requires clear metric and threshold design.
  • TF-IDF cosine is a strong baseline for many scripts.
  • Preprocessing quality strongly influences similarity outcomes.
  • Evaluate thresholds on labeled examples before production.
  • Use scalable candidate filtering for large document sets.

Course illustration
Course illustration

All Rights Reserved.