Better text documents clustering than tf/idf and cosine similarity?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
TF-IDF with cosine similarity is the classic baseline for document clustering, but it has known limitations: it ignores word order, misses semantic relationships (synonyms, paraphrases), and produces high-dimensional sparse vectors. Modern alternatives use dense embeddings from transformer models, topic models, or hybrid approaches that capture semantic meaning far better than bag-of-words representations.
Limitations of TF-IDF + Cosine Similarity
- No semantic understanding: "car" and "automobile" are treated as completely different terms
- High dimensionality: Vocabulary size can reach 100K+ dimensions, most of which are zero
- Context blindness: "bank" (river) and "bank" (finance) get the same representation
- Sensitive to vocabulary overlap: Documents about the same topic using different terminology score low similarity
Alternative 1: Sentence Transformers (Best for Most Cases)
Pre-trained transformer models produce dense embeddings that capture semantic meaning:
Why It's Better
| Feature | TF-IDF | Sentence Transformers |
| Dimensions | 10K-100K (sparse) | 384-768 (dense) |
| Semantic similarity | No | Yes |
| Synonyms | Missed | Captured |
| Context-aware | No | Yes |
| Pre-training data | None | Billions of sentences |
Alternative 2: BERTopic
BERTopic combines transformer embeddings with topic modeling:
BERTopic pipeline:
- Embed documents with sentence transformers
- Reduce dimensions with UMAP
- Cluster with HDBSCAN
- Extract topic representations with c-TF-IDF
Alternative 3: Doc2Vec
An extension of Word2Vec that learns document-level embeddings:
Doc2Vec is lighter than transformers but requires training on your corpus. Best for large, domain-specific collections.
Alternative 4: LDA (Latent Dirichlet Allocation)
A probabilistic topic model that discovers latent topics:
LDA works well for discovering interpretable topics but does not capture fine-grained semantic similarity.
Alternative 5: Word2Vec + Document Averaging
Average word embeddings to get document vectors:
Simple but loses word order information. Sentence Transformers are generally superior.
Choosing the Right Method
| Method | Best For | Semantic? | Speed | Setup |
| TF-IDF + Cosine | Baseline, keyword-focused | No | Fast | Minimal |
| Sentence Transformers | General-purpose, best quality | Yes | Medium | pip install sentence-transformers |
| BERTopic | Topic discovery + clustering | Yes | Medium | pip install bertopic |
| Doc2Vec | Large domain-specific corpora | Partial | Fast (after training) | Requires training |
| LDA | Interpretable topic modeling | Partial | Fast | Requires tuning |
| Word2Vec averaging | Quick semantic improvement | Partial | Fast | Pre-trained vectors needed |
Clustering Algorithms to Pair With
Common Pitfalls
- Embedding model choice matters:
all-MiniLM-L6-v2is a good default. For domain-specific text (medical, legal), fine-tune a model or use a domain-specific one likePubMedBERT. - Dimensionality reduction before clustering: For high-dimensional embeddings, apply UMAP or PCA before clustering. K-Means struggles in high dimensions due to the curse of dimensionality.
- Preprocessing still matters: Even with transformers, removing duplicates, very short documents, and boilerplate text improves clustering quality.
- Evaluation is hard: Clustering quality is often subjective. Use silhouette score, coherence score, or manual inspection of sample clusters.
- Computational cost: Sentence Transformers are slower than TF-IDF for very large corpora (millions of documents). Use batch encoding and GPU acceleration.
Summary
- Best overall: Sentence Transformers (
all-MiniLM-L6-v2) produce dense semantic embeddings that outperform TF-IDF for clustering - Best for topic discovery: BERTopic combines embeddings + UMAP + HDBSCAN for automatic topic clustering
- Lightweight alternative: Doc2Vec or Word2Vec averaging for faster processing
- Still useful: TF-IDF works well for keyword-focused tasks and as a baseline
- Pair modern embeddings with HDBSCAN or K-Means for the clustering step

