Scikit-learn
K-means
clustering
performance measurement
machine learning

Scikit K-means clustering performance measure

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction to K-means Clustering

K-means clustering is an unsupervised machine learning algorithm commonly used for partitioning a dataset into distinct clusters or groups. The primary aim of K-means clustering is to divide the dataset into KK clusters such that the variance within each cluster is minimized, while the variance between different clusters is maximized.

How K-means Works

  1. Initialization: The algorithm starts by selecting KK initial centroids randomly from the dataset.
  2. Assignment: Each data point in the dataset is assigned to the nearest centroid based on the Euclidean distance, forming KK clusters.
  3. Update: The centroid of each cluster is updated by calculating the mean of all data points within the cluster.
  4. Iteration: Steps 2 and 3 are repeated until convergence, which occurs when the centroids no longer move significantly.

Performance Measures for K-means Clustering

Evaluating the performance of a K-means clustering algorithm involves several metrics that quantify how well the clusters have been formed. Here are some common performance measures:

1. Sum of Squared Errors (SSE)

SSE is the most common metric for measuring the performance of K-means clustering. It is defined as the sum of the squared differences between each data point and its corresponding cluster centroid.

SSE=_i=1K_xC_ixμ_i2SSE = \sum\_{i=1}^{K} \sum\_{x \in C\_i} || x - \mu\_i ||^2

Pros: Directly related to the variance within the clusters. • Cons: Sensitive to the number of clusters and can decrease with increasing KK.

2. Silhouette Score

The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It ranges from -1 to 1, where a higher score indicates better-defined clusters.

s(i)=b(i)a(i)max(a(i),b(i))s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

• Here, a(i)a(i) is the average distance of point ii to all other points in its cluster, and b(i)b(i) is the average distance to the nearest cluster.

3. Davies-Bouldin Index

The Davies-Bouldin index is a measure of how well the clusters are separated. It is based on the ratio of the within-cluster scatter and between-cluster separation.

DB=1K_i=1Kmax_ji(s(C_i)+s(C_j)d(C_i,C_j))DB = \frac{1}{K} \sum\_{i=1}^{K} \max\_{j \neq i} \left( \frac{s(C\_i) + s(C\_j)}{d(C\_i, C\_j)} \right)

• A lower Davies-Bouldin index indicates better clustering.

4. Calinski-Harabasz Index

This index, also known as the Variance Ratio Criterion, calculates the ratio of the sum of between-cluster dispersion and within-cluster dispersion.

CH=trace(B_K)trace(W_K)nKK1CH = \frac{\text{trace}(B\_K)}{\text{trace}(W\_K)} \cdot \frac{n-K}{K-1}

• A higher Calinski-Harabasz index indicates better cluster separation.

Factors Affecting K-means Performance

While using the above metrics, several factors can influence the performance of K-means clustering:

Choice of KK: Selecting the correct number of clusters is crucial. Methods such as the Elbow Method, Silhouette Analysis, or Cross-validation can be used to identify the optimal KK.

Initialization of Centroids: Different initializations can lead to different results. The K-means++ algorithm provides a smarter way of choosing initial centroids to improve performance.

Distance Metric: While Euclidean distance is commonly used, other distance metrics (e.g., Manhattan, Cosine) may be more suitable depending on the dataset.

Feature Scaling: The performance of K-means can be significantly affected by the scale of the features. Therefore, it is essential to standardize or normalize the dataset before clustering.

Example

Let's consider a simple example using Scikit-learn to apply K-means clustering on the Iris dataset:


Course illustration
Course illustration