Scikit K-means clustering performance measure

Scikit-learn

K-means

clustering

performance measurement

machine learning

Scikit K-means clustering performance measure

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction to K-means Clustering

K-means clustering is an unsupervised machine learning algorithm commonly used for partitioning a dataset into distinct clusters or groups. The primary aim of K-means clustering is to divide the dataset into $K$ clusters such that the variance within each cluster is minimized, while the variance between different clusters is maximized.

How K-means Works

Initialization: The algorithm starts by selecting $K$ initial centroids randomly from the dataset.
Assignment: Each data point in the dataset is assigned to the nearest centroid based on the Euclidean distance, forming $K$ clusters.
Update: The centroid of each cluster is updated by calculating the mean of all data points within the cluster.
Iteration: Steps 2 and 3 are repeated until convergence, which occurs when the centroids no longer move significantly.

Performance Measures for K-means Clustering

Evaluating the performance of a K-means clustering algorithm involves several metrics that quantify how well the clusters have been formed. Here are some common performance measures:

1. Sum of Squared Errors (SSE)

SSE is the most common metric for measuring the performance of K-means clustering. It is defined as the sum of the squared differences between each data point and its corresponding cluster centroid.

$SSE = \sum\_{i=1}^{K} \sum\_{x \in C\_i} || x - \mu\_i ||^2$

• Pros: Directly related to the variance within the clusters. • Cons: Sensitive to the number of clusters and can decrease with increasing $K$ .

2. Silhouette Score

The silhouette score measures how similar a data point is to its own cluster compared to other clusters. It ranges from -1 to 1, where a higher score indicates better-defined clusters.

$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$

• Here, $a(i)$ is the average distance of point $i$ to all other points in its cluster, and $b(i)$ is the average distance to the nearest cluster.

3. Davies-Bouldin Index

The Davies-Bouldin index is a measure of how well the clusters are separated. It is based on the ratio of the within-cluster scatter and between-cluster separation.

$DB = \frac{1}{K} \sum\_{i=1}^{K} \max\_{j \neq i} \left( \frac{s(C\_i) + s(C\_j)}{d(C\_i, C\_j)} \right)$

• A lower Davies-Bouldin index indicates better clustering.

4. Calinski-Harabasz Index

This index, also known as the Variance Ratio Criterion, calculates the ratio of the sum of between-cluster dispersion and within-cluster dispersion.

$CH = \frac{\text{trace}(B\_K)}{\text{trace}(W\_K)} \cdot \frac{n-K}{K-1}$

• A higher Calinski-Harabasz index indicates better cluster separation.

Factors Affecting K-means Performance

While using the above metrics, several factors can influence the performance of K-means clustering:

• Choice of $K$ : Selecting the correct number of clusters is crucial. Methods such as the Elbow Method, Silhouette Analysis, or Cross-validation can be used to identify the optimal $K$ .

• Initialization of Centroids: Different initializations can lead to different results. The K-means++ algorithm provides a smarter way of choosing initial centroids to improve performance.

• Distance Metric: While Euclidean distance is commonly used, other distance metrics (e.g., Manhattan, Cosine) may be more suitable depending on the dataset.

• Feature Scaling: The performance of K-means can be significantly affected by the scale of the features. Therefore, it is essential to standardize or normalize the dataset before clustering.

Example

Let's consider a simple example using Scikit-learn to apply K-means clustering on the Iris dataset: