K- Means algorithm

K-Means Clustering

Machine Learning

Data Analytics

Unsupervised Learning

Algorithms

K- Means algorithm

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

K-Means is one of the most popular unsupervised learning algorithms used for clustering. Clustering involves grouping a set of objects in such a way that objects in the same group (referred to as a cluster) are more similar to each other than to those in other groups. The K-Means algorithm seeks to partition a given dataset into k clusters, where each cluster is represented by its mean value, known as the cluster centroid.

Technical Explanation

Algorithm Overview

The K-Means algorithm operates through a straightforward iterative process. Here is an outline of the steps involved in the K-Means algorithm:

Initialization: • Choose the number of clusters k . • Randomly select k centroids from the data points as initial cluster centers.
Assignment Step: • Assign each data point to the nearest centroid using a suitable distance metric, usually Euclidean distance: • For each point x_i and centroid μ_j , calculate the distance $d(x_i, μ_j)$ . • Assign x_i to the closest centroid.
Update Step: • Recalculate the centroids as the mean of the points assigned to each cluster: • $μ_j = \frac{1}{|C_j|} \sum_{x \in C_j} x$ where $C_j$ is the set of points in cluster j .
Convergence: • Repeat the Assignment and Update steps until the centroids no longer change significantly or a pre-defined number of iterations is reached.

Distance Metric

The performance and outcome of K-Means can depend on the choice of distance metric. The standard K-Means uses Euclidean distance, computed as follows between two data points $x = (x_1, x_2, ..., x_n)$ and $y = (y_1, y_2, ..., y_n)$ :

$\text{Euclidean distance} = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$

Example

Consider a dataset with the following coordinates:

Point ID	Coordinate (x, y)
A	(1, 2)
B	(1, 4)
C	(3, 4)
D	(5, 2)
E	(5, 4)
F	(6, 4)

Suppose we want to cluster these into k=2 clusters. Initial centroids could be A and D . The steps would proceed as follows:

• Iteration 1: • Assign points to nearest centroids. • Recalculate new centroids.

By repeatedly applying the steps, clusters stabilize when no points switch clusters or centroids stop changing significantly.

Key Considerations

• **Choosing k **: The number of clusters k is user-defined and can significantly influence the results. The Elbow Method is frequently used to select k by plotting the total variance explained as a function of the number of clusters.

• Scalability: K-Means is computationally efficient, making it suitable for large datasets.

• Limitations: • K-Means assumes clusters are spherical and equal in size. • It is sensitive to outliers and noise. • Requires pre-specification of k , which may not be trivial for certain datasets.

Table of Summary

Feature	Description
Type	Unsupervised learning (Clustering)
Complexity	Usually, $O(n \cdot k \cdot i \cdot d)$ : n = samples, k = clusters, i = iterations, d = dimensions
Distance Metric	Commonly Euclidean Distance
Initial Choice	Random initial centroids
Output	A partition of data points into `k` clusters
Strengths	Simple, fast, and efficient for large datasets
Weaknesses	Prone to outliers, assumes equal-sized circular clusters
Scalability	Suitable for large datasets due to efficient updates

Variants and Extensions

• K-Means++ Initialization: An enhancement over random initialization that improves convergence speed by selecting initial centroids that are far apart from each other.

• Hierarchical K-Means: Combines hierarchical clustering methods with K-Means to avoid some of its drawbacks.

• MiniBatch K-Means: Suitable for large-scale datasets by using mini-batches to reduce computation time.

Conclusion

K-Means is an efficient and widely-used algorithm for clustering, characterized by its simplicity and scalability. Though it has limitations, various adaptations and careful parameter selection often lead to successful data partitioning in practical applications.