text clustering
document similarity
tf-idf alternative
clustering techniques
cosine similarity improvement

Better text documents clustering than tf/idf and cosine similarity?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

TF-IDF with cosine similarity is the classic baseline for document clustering, but it has known limitations: it ignores word order, misses semantic relationships (synonyms, paraphrases), and produces high-dimensional sparse vectors. Modern alternatives use dense embeddings from transformer models, topic models, or hybrid approaches that capture semantic meaning far better than bag-of-words representations.

Limitations of TF-IDF + Cosine Similarity

  • No semantic understanding: "car" and "automobile" are treated as completely different terms
  • High dimensionality: Vocabulary size can reach 100K+ dimensions, most of which are zero
  • Context blindness: "bank" (river) and "bank" (finance) get the same representation
  • Sensitive to vocabulary overlap: Documents about the same topic using different terminology score low similarity

Alternative 1: Sentence Transformers (Best for Most Cases)

Pre-trained transformer models produce dense embeddings that capture semantic meaning:

python
1from sentence_transformers import SentenceTransformer
2from sklearn.cluster import KMeans
3import numpy as np
4
5# Load a pre-trained model
6model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, good quality
7
8documents = [
9    "The stock market crashed today",
10    "Financial markets saw a major decline",
11    "A new species of butterfly was discovered",
12    "Researchers found an unknown insect in the Amazon",
13    "Tesla's share price dropped significantly",
14]
15
16# Generate embeddings (384 dimensions vs 100K+ for TF-IDF)
17embeddings = model.encode(documents)
18
19# Cluster
20kmeans = KMeans(n_clusters=2, random_state=42)
21labels = kmeans.fit_predict(embeddings)
22print(labels)  # [0, 0, 1, 1, 0] — correctly groups by topic

Why It's Better

FeatureTF-IDFSentence Transformers
Dimensions10K-100K (sparse)384-768 (dense)
Semantic similarityNoYes
SynonymsMissedCaptured
Context-awareNoYes
Pre-training dataNoneBillions of sentences

Alternative 2: BERTopic

BERTopic combines transformer embeddings with topic modeling:

python
1from bertopic import BERTopic
2
3documents = [...]  # Your document list
4
5# BERTopic handles embedding, dimensionality reduction, and clustering
6topic_model = BERTopic()
7topics, probs = topic_model.fit_transform(documents)
8
9# View topics
10topic_model.get_topic_info()
11
12# Visualize
13topic_model.visualize_topics()

BERTopic pipeline:

  1. Embed documents with sentence transformers
  2. Reduce dimensions with UMAP
  3. Cluster with HDBSCAN
  4. Extract topic representations with c-TF-IDF

Alternative 3: Doc2Vec

An extension of Word2Vec that learns document-level embeddings:

python
1from gensim.models.doc2vec import Doc2Vec, TaggedDocument
2from sklearn.cluster import KMeans
3
4# Prepare tagged documents
5tagged_docs = [TaggedDocument(doc.split(), [i]) for i, doc in enumerate(documents)]
6
7# Train Doc2Vec
8model = Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=1, epochs=100)
9
10# Get document vectors
11vectors = [model.dv[i] for i in range(len(documents))]
12
13# Cluster
14kmeans = KMeans(n_clusters=3)
15labels = kmeans.fit_predict(vectors)

Doc2Vec is lighter than transformers but requires training on your corpus. Best for large, domain-specific collections.

Alternative 4: LDA (Latent Dirichlet Allocation)

A probabilistic topic model that discovers latent topics:

python
1from sklearn.feature_extraction.text import CountVectorizer
2from sklearn.decomposition import LatentDirichletAllocation
3
4vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
5doc_term_matrix = vectorizer.fit_transform(documents)
6
7lda = LatentDirichletAllocation(n_components=5, random_state=42)
8topic_distributions = lda.fit_transform(doc_term_matrix)
9
10# Each document is now a distribution over 5 topics
11# Cluster based on topic distributions
12kmeans = KMeans(n_clusters=3)
13labels = kmeans.fit_predict(topic_distributions)

LDA works well for discovering interpretable topics but does not capture fine-grained semantic similarity.

Alternative 5: Word2Vec + Document Averaging

Average word embeddings to get document vectors:

python
1import numpy as np
2from gensim.models import KeyedVectors
3
4# Load pre-trained word vectors
5wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
6
7def doc_vector(doc, model):
8    words = doc.lower().split()
9    vectors = [model[w] for w in words if w in model]
10    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)
11
12doc_vectors = np.array([doc_vector(doc, wv) for doc in documents])
13
14# Cluster
15kmeans = KMeans(n_clusters=3)
16labels = kmeans.fit_predict(doc_vectors)

Simple but loses word order information. Sentence Transformers are generally superior.

Choosing the Right Method

MethodBest ForSemantic?SpeedSetup
TF-IDF + CosineBaseline, keyword-focusedNoFastMinimal
Sentence TransformersGeneral-purpose, best qualityYesMediumpip install sentence-transformers
BERTopicTopic discovery + clusteringYesMediumpip install bertopic
Doc2VecLarge domain-specific corporaPartialFast (after training)Requires training
LDAInterpretable topic modelingPartialFastRequires tuning
Word2Vec averagingQuick semantic improvementPartialFastPre-trained vectors needed

Clustering Algorithms to Pair With

python
1from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
2import hdbscan
3
4# K-Means — when you know the number of clusters
5labels = KMeans(n_clusters=5).fit_predict(embeddings)
6
7# HDBSCAN — automatic cluster count, handles noise
8labels = hdbscan.HDBSCAN(min_cluster_size=5).fit_predict(embeddings)
9
10# Agglomerative — hierarchical, good for dendrograms
11labels = AgglomerativeClustering(n_clusters=5).fit_predict(embeddings)
12
13# DBSCAN — density-based, handles noise
14labels = DBSCAN(eps=0.5, min_samples=3).fit_predict(embeddings)

Common Pitfalls

  • Embedding model choice matters: all-MiniLM-L6-v2 is a good default. For domain-specific text (medical, legal), fine-tune a model or use a domain-specific one like PubMedBERT.
  • Dimensionality reduction before clustering: For high-dimensional embeddings, apply UMAP or PCA before clustering. K-Means struggles in high dimensions due to the curse of dimensionality.
  • Preprocessing still matters: Even with transformers, removing duplicates, very short documents, and boilerplate text improves clustering quality.
  • Evaluation is hard: Clustering quality is often subjective. Use silhouette score, coherence score, or manual inspection of sample clusters.
  • Computational cost: Sentence Transformers are slower than TF-IDF for very large corpora (millions of documents). Use batch encoding and GPU acceleration.

Summary

  • Best overall: Sentence Transformers (all-MiniLM-L6-v2) produce dense semantic embeddings that outperform TF-IDF for clustering
  • Best for topic discovery: BERTopic combines embeddings + UMAP + HDBSCAN for automatic topic clustering
  • Lightweight alternative: Doc2Vec or Word2Vec averaging for faster processing
  • Still useful: TF-IDF works well for keyword-focused tasks and as a baseline
  • Pair modern embeddings with HDBSCAN or K-Means for the clustering step

Course illustration
Course illustration

All Rights Reserved.