Better text documents clustering than tf/idf and cosine similarity?

text clustering

document similarity

tf-idf alternative

clustering techniques

cosine similarity improvement

Better text documents clustering than tf/idf and cosine similarity?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

TF-IDF with cosine similarity is the classic baseline for document clustering, but it has known limitations: it ignores word order, misses semantic relationships (synonyms, paraphrases), and produces high-dimensional sparse vectors. Modern alternatives use dense embeddings from transformer models, topic models, or hybrid approaches that capture semantic meaning far better than bag-of-words representations.

Limitations of TF-IDF + Cosine Similarity

No semantic understanding: "car" and "automobile" are treated as completely different terms
High dimensionality: Vocabulary size can reach 100K+ dimensions, most of which are zero
Context blindness: "bank" (river) and "bank" (finance) get the same representation
Sensitive to vocabulary overlap: Documents about the same topic using different terminology score low similarity

Alternative 1: Sentence Transformers (Best for Most Cases)

Pre-trained transformer models produce dense embeddings that capture semantic meaning:

python

1from sentence_transformers import SentenceTransformer
2from sklearn.cluster import KMeans
3import numpy as np
4
5# Load a pre-trained model
6model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, good quality
7
8documents = [
9    "The stock market crashed today",
10    "Financial markets saw a major decline",
11    "A new species of butterfly was discovered",
12    "Researchers found an unknown insect in the Amazon",
13    "Tesla's share price dropped significantly",
14]
15
16# Generate embeddings (384 dimensions vs 100K+ for TF-IDF)
17embeddings = model.encode(documents)
18
19# Cluster
20kmeans = KMeans(n_clusters=2, random_state=42)
21labels = kmeans.fit_predict(embeddings)
22print(labels)  # [0, 0, 1, 1, 0] — correctly groups by topic

Why It's Better

Feature	TF-IDF	Sentence Transformers
Dimensions	10K-100K (sparse)	384-768 (dense)
Semantic similarity	No	Yes
Synonyms	Missed	Captured
Context-aware	No	Yes
Pre-training data	None	Billions of sentences

Alternative 2: BERTopic

BERTopic combines transformer embeddings with topic modeling:

python

1from bertopic import BERTopic
2
3documents = [...]  # Your document list
4
5# BERTopic handles embedding, dimensionality reduction, and clustering
6topic_model = BERTopic()
7topics, probs = topic_model.fit_transform(documents)
8
9# View topics
10topic_model.get_topic_info()
11
12# Visualize
13topic_model.visualize_topics()

BERTopic pipeline:

Embed documents with sentence transformers
Reduce dimensions with UMAP
Cluster with HDBSCAN
Extract topic representations with c-TF-IDF

Alternative 3: Doc2Vec

An extension of Word2Vec that learns document-level embeddings:

python

1from gensim.models.doc2vec import Doc2Vec, TaggedDocument
2from sklearn.cluster import KMeans
3
4# Prepare tagged documents
5tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]
6
7# Train Doc2Vec
8model = Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=1, epochs=100)
9
10# Get document vectors
11vectors = [model.dv[i] for i in range(len(documents))]
12
13# Cluster
14kmeans = KMeans(n_clusters=3)
15labels = kmeans.fit_predict(vectors)

Doc2Vec is lighter than transformers but requires training on your corpus. Best for large, domain-specific collections.

Alternative 4: LDA (Latent Dirichlet Allocation)

A probabilistic topic model that discovers latent topics:

python

1from sklearn.feature_extraction.text import CountVectorizer
2from sklearn.decomposition import LatentDirichletAllocation
3
4vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
5doc_term_matrix = vectorizer.fit_transform(documents)
6
7lda = LatentDirichletAllocation(n_components=5, random_state=42)
8topic_distributions = lda.fit_transform(doc_term_matrix)
9
10# Each document is now a distribution over 5 topics
11# Cluster based on topic distributions
12kmeans = KMeans(n_clusters=3)
13labels = kmeans.fit_predict(topic_distributions)

LDA works well for discovering interpretable topics but does not capture fine-grained semantic similarity.

Alternative 5: Word2Vec + Document Averaging

Average word embeddings to get document vectors:

python

1import numpy as np
2from gensim.models import KeyedVectors
3
4# Load pre-trained word vectors
5wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
6
7def doc_vector(doc, model):
8    words = doc.lower().split()
9    vectors = [model[w] for w in words if w in model]
10    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)
11
12doc_vectors = np.array([doc_vector(doc, wv) for doc in documents])
13
14# Cluster
15kmeans = KMeans(n_clusters=3)
16labels = kmeans.fit_predict(doc_vectors)

Simple but loses word order information. Sentence Transformers are generally superior.

Choosing the Right Method

Method	Best For	Semantic?	Speed	Setup
TF-IDF + Cosine	Baseline, keyword-focused	No	Fast	Minimal
Sentence Transformers	General-purpose, best quality	Yes	Medium	`pip install sentence-transformers`
BERTopic	Topic discovery + clustering	Yes	Medium	`pip install bertopic`
Doc2Vec	Large domain-specific corpora	Partial	Fast (after training)	Requires training
LDA	Interpretable topic modeling	Partial	Fast	Requires tuning
Word2Vec averaging	Quick semantic improvement	Partial	Fast	Pre-trained vectors needed

Clustering Algorithms to Pair With

python

1from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
2import hdbscan
3
4# K-Means — when you know the number of clusters
5labels = KMeans(n_clusters=5).fit_predict(embeddings)
6
7# HDBSCAN — automatic cluster count, handles noise
8labels = hdbscan.HDBSCAN(min_cluster_size=5).fit_predict(embeddings)
9
10# Agglomerative — hierarchical, good for dendrograms
11labels = AgglomerativeClustering(n_clusters=5).fit_predict(embeddings)
12
13# DBSCAN — density-based, handles noise
14labels = DBSCAN(eps=0.5, min_samples=3).fit_predict(embeddings)

Common Pitfalls

Embedding model choice matters: all-MiniLM-L6-v2 is a good default. For domain-specific text (medical, legal), fine-tune a model or use a domain-specific one like PubMedBERT.
Dimensionality reduction before clustering: For high-dimensional embeddings, apply UMAP or PCA before clustering. K-Means struggles in high dimensions due to the curse of dimensionality.
Preprocessing still matters: Even with transformers, removing duplicates, very short documents, and boilerplate text improves clustering quality.
Evaluation is hard: Clustering quality is often subjective. Use silhouette score, coherence score, or manual inspection of sample clusters.
Computational cost: Sentence Transformers are slower than TF-IDF for very large corpora (millions of documents). Use batch encoding and GPU acceleration.

Summary

Best overall: Sentence Transformers (all-MiniLM-L6-v2) produce dense semantic embeddings that outperform TF-IDF for clustering
Best for topic discovery: BERTopic combines embeddings + UMAP + HDBSCAN for automatic topic clustering
Lightweight alternative: Doc2Vec or Word2Vec averaging for faster processing
Still useful: TF-IDF works well for keyword-focused tasks and as a baseline
Pair modern embeddings with HDBSCAN or K-Means for the clustering step