Image clustering by its similarity in python
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Image clustering means grouping images so that visually similar items end up together without manually labeling every file. In practice, the quality of the clusters depends much more on the features you extract from each image than on the clustering algorithm itself.
Extract Useful Image Features First
Clustering raw pixels rarely works well because small shifts in lighting, crop, or background can dominate the distance calculation. A better approach is to convert each image into a compact feature vector using a pretrained vision model.
The example below uses torchvision with a pretrained resnet18, removes the classifier head, and keeps the feature embedding.
Each image becomes a numeric representation that captures higher-level visual structure better than a color histogram alone.
Cluster the Embeddings
Once you have one feature vector per image, cluster them with a standard algorithm such as KMeans.
Normalizing embeddings is often helpful because clustering then focuses more on direction in feature space than on raw vector magnitude.
Organize the Results
Printing cluster IDs is useful for debugging, but real workflows usually group files by cluster so you can inspect them.
At this stage you can review whether the model grouped images by subject, color palette, composition, or some other visual cue.
Choosing the Number of Clusters
KMeans needs a value for n_clusters, and that number affects the result a lot. In exploratory work, try several values and inspect the outputs. You can also compute metrics such as silhouette score, but human inspection still matters because "good" visual grouping is often application-specific.
If you do not know the number of groups in advance, other algorithms such as DBSCAN or hierarchical clustering may be a better fit. The important part is that the embedding quality usually matters more than the choice between reasonable clustering algorithms.
Common Pitfalls
- Clustering raw pixels often groups by lighting or background noise instead of real semantic similarity.
- Using a pretrained model without resizing and normalizing images correctly degrades the embeddings.
- Choosing
n_clustersarbitrarily can produce clusters that are technically valid but not useful. - Large image sets can make feature extraction the slow step, so cache embeddings if you plan to experiment repeatedly.
- Similarity is task-dependent. A model pretrained on general objects may not cluster medical images or product thumbnails the way you expect.
Summary
- For image clustering, start by extracting meaningful embeddings rather than comparing raw pixels.
- A pretrained CNN plus
KMeansis a solid baseline in Python. - Normalize embeddings, inspect cluster outputs, and experiment with the number of clusters.
- Cache features and review the results visually, because similarity quality is ultimately defined by your use case.

