Image clustering by its similarity in python

image clustering

similarity analysis

python programming

machine learning

computer vision

Image clustering by its similarity in python

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Image clustering means grouping images so that visually similar items end up together without manually labeling every file. In practice, the quality of the clusters depends much more on the features you extract from each image than on the clustering algorithm itself.

Extract Useful Image Features First

Clustering raw pixels rarely works well because small shifts in lighting, crop, or background can dominate the distance calculation. A better approach is to convert each image into a compact feature vector using a pretrained vision model.

The example below uses torchvision with a pretrained resnet18, removes the classifier head, and keeps the feature embedding.

python

1from pathlib import Path
2
3import numpy as np
4import torch
5from PIL import Image
6from sklearn.cluster import KMeans
7from sklearn.preprocessing import normalize
8from torchvision import models, transforms
9
10device = torch.device("cpu")
11
12model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
13feature_extractor = torch.nn.Sequential(*list(model.children())[:-1]).to(device)
14feature_extractor.eval()
15
16preprocess = transforms.Compose([
17    transforms.Resize((224, 224)),
18    transforms.ToTensor(),
19    transforms.Normalize(
20        mean=[0.485, 0.456, 0.406],
21        std=[0.229, 0.224, 0.225],
22    ),
23])
24
25
26def extract_feature(image_path: Path) -> np.ndarray:
27    image = Image.open(image_path).convert("RGB")
28    tensor = preprocess(image).unsqueeze(0).to(device)
29
30    with torch.no_grad():
31        features = feature_extractor(tensor)
32
33    return features.squeeze().cpu().numpy()

Each image becomes a numeric representation that captures higher-level visual structure better than a color histogram alone.

Cluster the Embeddings

Once you have one feature vector per image, cluster them with a standard algorithm such as KMeans.

python

1image_dir = Path("images")
2image_paths = sorted(image_dir.glob("*.jpg"))
3
4embeddings = np.vstack([extract_feature(path) for path in image_paths])
5embeddings = normalize(embeddings)
6
7kmeans = KMeans(n_clusters=4, random_state=42, n_init="auto")
8labels = kmeans.fit_predict(embeddings)
9
10for path, label in zip(image_paths, labels):
11    print(label, path.name)

Normalizing embeddings is often helpful because clustering then focuses more on direction in feature space than on raw vector magnitude.

Organize the Results

Printing cluster IDs is useful for debugging, but real workflows usually group files by cluster so you can inspect them.

python

1from collections import defaultdict
2
3clusters = defaultdict(list)
4for path, label in zip(image_paths, labels):
5    clusters[int(label)].append(path.name)
6
7for label, names in clusters.items():
8    print(f"Cluster {label}")
9    for name in names:
10        print("  ", name)

At this stage you can review whether the model grouped images by subject, color palette, composition, or some other visual cue.

Choosing the Number of Clusters

KMeans needs a value for n_clusters, and that number affects the result a lot. In exploratory work, try several values and inspect the outputs. You can also compute metrics such as silhouette score, but human inspection still matters because "good" visual grouping is often application-specific.

If you do not know the number of groups in advance, other algorithms such as DBSCAN or hierarchical clustering may be a better fit. The important part is that the embedding quality usually matters more than the choice between reasonable clustering algorithms.

Common Pitfalls

Clustering raw pixels often groups by lighting or background noise instead of real semantic similarity.
Using a pretrained model without resizing and normalizing images correctly degrades the embeddings.
Choosing n_clusters arbitrarily can produce clusters that are technically valid but not useful.
Large image sets can make feature extraction the slow step, so cache embeddings if you plan to experiment repeatedly.
Similarity is task-dependent. A model pretrained on general objects may not cluster medical images or product thumbnails the way you expect.

Summary

For image clustering, start by extracting meaningful embeddings rather than comparing raw pixels.
A pretrained CNN plus KMeans is a solid baseline in Python.
Normalize embeddings, inspect cluster outputs, and experiment with the number of clusters.
Cache features and review the results visually, because similarity quality is ultimately defined by your use case.