Scikit Learn - K-Means - Elbow - criterion

K-Means

Scikit Learn

Elbow Method

Clustering

Machine Learning

Scikit Learn - K-Means - Elbow - criterion

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In cluster analysis, the K-Means algorithm is one of the most popular methods due to its simplicity and efficiency. Scikit-learn, a widely-used machine learning library in Python, offers a robust implementation of K-Means clustering which supports various optimization and initialization techniques. A vital part of using K-Means effectively is selecting the right number of clusters, denoted as K. The "Elbow Method" is a heuristic used to determine the most appropriate number of clusters in a data set.

K-Means Clustering

K-Means is a partitioning method that divides a dataset into K distinct, non-overlapping subgroups known as clusters. The algorithm works as follows:

Initialization: Choose K initial centroids, which can be selected randomly or using methods like K-Means++.
Assignment: Assign each data point to the nearest centroid.
Update: Calculate the mean of all data points assigned to each centroid and update the centroid's position.
Repeat the assign-update steps until convergence, which occurs when there is no change in the assignment of data points or the centroids' positions.

The goal of K-Means is to minimize the within-cluster sum of squares (WCSS), also known as inertia. It measures the compactness of clusters, where a lower value means more compact clusters.

The Elbow Method

The Elbow Method is a graphical approach for determining the optimal number of clusters by observing the WCSS for various values of K. The steps for employing the Elbow Method are:

**Compute the K-Means clustering for different values of K (e.g., from 1 to 10) and calculate the WCSS for each *K*.
*Plot the WCSS values against the number of clusters K.
Identify the "elbow" point in the plot, which is the point where the rate of decrease sharply changes.

The “elbow” resembles the bending of a human arm. Before the elbow point, adding another cluster results in a significant reduction in WCSS, but after the elbow, the reduction is minimal. This suggests that the optimal K is at the elbow point.

Technical Explanation

Calculating WCSS

The WCSS can be mathematically expressed as:

$WCSS = \sum_{i=1}^{K} \sum_{x \in C_i} (x - \mu_i)^2$ where $x$ is a data point, $C_i$ is the i-th cluster, and $\mu_i$ is the mean of the i-th cluster. It calculates the sum of squared distances between data points and their corresponding cluster centroids for all clusters.

Limitations of the Elbow Method

Subjective: Identifying the elbow point can be subjective, especially when the graph does not have a clear bend.
Multi-Dimensionality: In high dimensional data, the visual elbow may not be apparent.
Oversimplification: The method assumes centroid distance is the sole metric for cluster quality, possibly ignoring the shape or connectivity of the clusters.

Example of K-Means with the Elbow Criterion in Scikit-Learn

Here's a Python example using Scikit-learn to implement K-Means and the Elbow Method:

python

1import numpy as np
2import matplotlib.pyplot as plt
3from sklearn.cluster import KMeans
4from sklearn.datasets import make_blobs
5
6# Generate synthetic data
7X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
8
9# Elbow Method
10wcss = []
11for i in range(1, 11):
12    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
13    kmeans.fit(X)
14    wcss.append(kmeans.inertia_)
15
16# Plot the Elbow Graph
17plt.plot(range(1, 11), wcss)
18plt.title('Elbow Method')
19plt.xlabel('Number of clusters')
20plt.ylabel('WCSS')
21plt.show()

Summary Table

Here’s a summary table of K-Means and the Elbow Method:

Aspect	Description
Initialization	Randomly choose initial centroids or use K-Means++
Distance Measure	Euclidean distance is typically used
Goal	Minimize WCSS for compact clusters
Elbow Method	Determine optimal K by finding the "elbow" in WCSS plot
Convergence	When centroids no longer move or assignments do not change
Computational Complexity	$O(n \times k \times i \times d)$ where $n$ =samples, $k$ =clusters, $i$ =iterations, $d$ =dimensions
Advantages	Simple, scales to a large number of samples
Disadvantages	Requires pre-specified $K$ , sensitive to initial centroid positions, assumes spherical clusters