Algorithm to detect similar documents in python script
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Detecting similar documents in Python depends on both algorithm choice and text preprocessing quality. Exact string matching is rarely enough because similar content can differ by punctuation, word order, or minor edits. A good solution defines similarity metric, threshold policy, and scalability strategy up front.
Choose a Similarity Approach
Common options:
- TF-IDF with cosine similarity for lexical overlap.
- N-gram Jaccard similarity for approximate token overlap.
- Embedding models for semantic similarity.
For many practical scripts, TF-IDF cosine is a strong baseline with predictable behavior.
TF-IDF Cosine Baseline
Use scikit-learn to vectorize corpus and compute pairwise similarity matrix.
This matrix gives similarity score between every pair.
Extract Similar Pairs with Threshold
Convert matrix to pair list using a threshold.
Threshold should be tuned on labeled examples, not guessed once.
Preprocessing Has Large Impact
Quality often depends more on preprocessing than algorithm.
Typical steps:
- Lowercasing.
- Stop-word filtering.
- Optional stemming or lemmatization.
- Removing boilerplate text.
If documents include templates, remove common headers and footers before vectorization to avoid inflated similarity.
Alternative: MinHash for Large Collections
For very large corpora, full pairwise cosine is expensive. MinHash and locality-sensitive hashing can reduce comparisons by finding likely candidates first.
This is useful when you need near-duplicate detection at scale.
Semantic Similarity Option
When lexical overlap is low but meaning is similar, embedding-based models can outperform TF-IDF. Use sentence embeddings and cosine similarity on vectors.
This increases compute cost but improves paraphrase detection.
Evaluation Strategy
Build a small labeled set with examples of:
- true duplicates.
- near duplicates.
- unrelated documents.
Then measure precision and recall across thresholds. Without evaluation, similarity pipelines can produce many false positives or miss important matches.
Production Considerations
For operational use:
- Cache vectorizer and model objects.
- Version preprocessing rules.
- Log top-matching pairs for review.
- Recompute index incrementally for new documents.
If script runs periodically, store previous signatures to avoid reprocessing unchanged content.
Near-Duplicate Detection with Shingling
For plagiarism and near-duplicate workflows, document-level token overlap can be improved with character or word shingles. This captures local phrase continuity better than bag-of-words alone.
This simple metric can complement TF-IDF when local word order matters.
Pipeline Design Tip
In production, use a two-stage pipeline: cheap candidate retrieval first, then expensive similarity scoring. This keeps throughput high while preserving detection quality.## Common Pitfalls
- Using raw text without normalization and expecting stable scores.
- Choosing threshold without labeled validation data.
- Running full pairwise comparison on large corpora without candidate filtering.
- Treating lexical similarity as semantic equivalence in all domains.
- Ignoring multilingual content requirements in one-language preprocessing.
Summary
- Document similarity requires clear metric and threshold design.
- TF-IDF cosine is a strong baseline for many scripts.
- Preprocessing quality strongly influences similarity outcomes.
- Evaluate thresholds on labeled examples before production.
- Use scalable candidate filtering for large document sets.

