Scikit K-means clustering performance measure
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction to K-means Clustering
K-means clustering is an unsupervised machine learning algorithm commonly used for partitioning a dataset into distinct clusters or groups. The primary aim of K-means clustering is to divide the dataset into clusters such that the variance within each cluster is minimized, while the variance between different clusters is maximized.
How K-means Works
- Initialization: The algorithm starts by selecting initial centroids randomly from the dataset.
- Assignment: Each data point in the dataset is assigned to the nearest centroid based on the Euclidean distance, forming clusters.
- Update: The centroid of each cluster is updated by calculating the mean of all data points within the cluster.
- Iteration: Steps 2 and 3 are repeated until convergence, which occurs when the centroids no longer move significantly.
Performance Measures for K-means Clustering
Evaluating the performance of a K-means clustering algorithm involves several metrics that quantify how well the clusters have been formed. Here are some common performance measures:
1. Sum of Squared Errors (SSE)
SSE is the most common metric for measuring the performance of K-means clustering. It is defined as the sum of the squared differences between each data point and its corresponding cluster centroid.
• Pros: Directly related to the variance within the clusters. • Cons: Sensitive to the number of clusters and can decrease with increasing .
2. Silhouette Score
The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It ranges from -1 to 1, where a higher score indicates better-defined clusters.
• Here, is the average distance of point to all other points in its cluster, and is the average distance to the nearest cluster.
3. Davies-Bouldin Index
The Davies-Bouldin index is a measure of how well the clusters are separated. It is based on the ratio of the within-cluster scatter and between-cluster separation.
• A lower Davies-Bouldin index indicates better clustering.
4. Calinski-Harabasz Index
This index, also known as the Variance Ratio Criterion, calculates the ratio of the sum of between-cluster dispersion and within-cluster dispersion.
• A higher Calinski-Harabasz index indicates better cluster separation.
Factors Affecting K-means Performance
While using the above metrics, several factors can influence the performance of K-means clustering:
• Choice of : Selecting the correct number of clusters is crucial. Methods such as the Elbow Method, Silhouette Analysis, or Cross-validation can be used to identify the optimal .
• Initialization of Centroids: Different initializations can lead to different results. The K-means++ algorithm provides a smarter way of choosing initial centroids to improve performance.
• Distance Metric: While Euclidean distance is commonly used, other distance metrics (e.g., Manhattan, Cosine) may be more suitable depending on the dataset.
• Feature Scaling: The performance of K-means can be significantly affected by the scale of the features. Therefore, it is essential to standardize or normalize the dataset before clustering.
Example
Let's consider a simple example using Scikit-learn to apply K-means clustering on the Iris dataset:

