scikit-learn
clustering
text documents
DBSCAN
machine learning

scikit-learn clustering text documents using DBSCAN

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

DBSCAN can cluster text documents, but it behaves differently from the more common K-means approach because it groups points by density and labels sparse points as noise. In text problems, that means the success of DBSCAN depends heavily on how you vectorize the documents and how you choose the distance metric and eps threshold.

Vectorize the Documents First

DBSCAN works on numeric feature vectors, so text must first be converted into a vector space. A common starting point is TF-IDF.

python
1from sklearn.feature_extraction.text import TfidfVectorizer
2
3documents = [
4    "cats chase mice in the house",
5    "dogs bark loudly at night",
6    "the kitten plays with yarn",
7    "puppies need food and water",
8    "machine learning models classify text",
9    "document clustering groups related text",
10]
11
12vectorizer = TfidfVectorizer(stop_words="english")
13X = vectorizer.fit_transform(documents)
14
15print(X.shape)

The result is a sparse matrix where each row represents one document.

Apply DBSCAN

For text, cosine distance is often more meaningful than Euclidean distance because document direction matters more than absolute magnitude.

python
1from sklearn.cluster import DBSCAN
2
3dbscan = DBSCAN(
4    eps=0.7,
5    min_samples=2,
6    metric="cosine"
7)
8
9labels = dbscan.fit_predict(X)
10print(labels)

The labels work like this:

  • non-negative integers are cluster IDs
  • '-1 means noise'

That noise label is one of DBSCAN's major differences from algorithms that force every document into some cluster.

Why DBSCAN Is Tricky for Text

Text vectors are usually high-dimensional and sparse. In that kind of space:

  • many documents are far apart
  • density estimates become unstable
  • 'eps can be difficult to tune'

So DBSCAN can work, but it is more sensitive than people expect. You often need to experiment with:

  • stop-word handling
  • n-grams
  • dimensionality reduction
  • 'eps'
  • 'min_samples'

A Small End-to-End Example

python
1from sklearn.feature_extraction.text import TfidfVectorizer
2from sklearn.cluster import DBSCAN
3
4docs = [
5    "apple banana fruit salad",
6    "banana orange fruit smoothie",
7    "python machine learning pipeline",
8    "deep learning neural networks",
9    "fresh fruit market apples",
10]
11
12X = TfidfVectorizer(stop_words="english").fit_transform(docs)
13
14model = DBSCAN(eps=0.8, min_samples=2, metric="cosine")
15labels = model.fit_predict(X)
16
17for doc, label in zip(docs, labels):
18    print(label, "->", doc)

You may see the fruit documents cluster together while technical documents fall into a different cluster or become noise, depending on the exact parameters.

When Dimensionality Reduction Helps

Sometimes it helps to reduce the TF-IDF space before clustering. Truncated SVD is a common choice for sparse text data:

python
1from sklearn.decomposition import TruncatedSVD
2
3svd = TruncatedSVD(n_components=50, random_state=42)
4X_reduced = svd.fit_transform(X)
5
6labels = DBSCAN(eps=0.5, min_samples=2).fit_predict(X_reduced)

Reducing the space can make distances more stable, though it also changes the geometry of the problem. It is a tuning tool, not a guaranteed improvement.

Common Pitfalls

The most common mistake is using DBSCAN with default parameters and expecting it to work well on high-dimensional text immediately. eps in particular is highly problem-specific.

Another issue is ignoring the distance metric. Euclidean distance on raw sparse TF-IDF vectors is often a poor default for document similarity.

A third pitfall is misreading noise points. A label of -1 does not necessarily mean the document is bad; it means DBSCAN did not find enough nearby neighbors under the current settings.

Finally, if your real goal is "partition all documents into a small fixed number of topics," K-means or topic modeling may be a better fit than DBSCAN.

Summary

  • To use DBSCAN on text, first convert documents into vectors, often with TF-IDF.
  • Cosine distance is commonly more appropriate than Euclidean distance for text similarity.
  • DBSCAN can find clusters and mark outliers as noise.
  • Parameter tuning is critical because text vectors are sparse and high-dimensional.
  • Consider dimensionality reduction or alternative clustering methods if DBSCAN is unstable.

Course illustration
Course illustration

All Rights Reserved.