K- Means algorithm
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
K-Means is one of the most popular unsupervised learning algorithms used for clustering. Clustering involves grouping a set of objects in such a way that objects in the same group (referred to as a cluster) are more similar to each other than to those in other groups. The K-Means algorithm seeks to partition a given dataset into k
clusters, where each cluster is represented by its mean value, known as the cluster centroid.
Technical Explanation
Algorithm Overview
The K-Means algorithm operates through a straightforward iterative process. Here is an outline of the steps involved in the K-Means algorithm:
- Initialization: • Choose the number of clusters
k. • Randomly selectkcentroids from the data points as initial cluster centers. - Assignment Step: • Assign each data point to the nearest centroid using a suitable distance metric, usually Euclidean distance: • For each point
x_iand centroidμ_j, calculate the distance . • Assignx_ito the closest centroid. - Update Step: • Recalculate the centroids as the mean of the points assigned to each cluster: • where is the set of points in cluster
j. - Convergence: • Repeat the Assignment and Update steps until the centroids no longer change significantly or a pre-defined number of iterations is reached.
Distance Metric
The performance and outcome of K-Means can depend on the choice of distance metric. The standard K-Means uses Euclidean distance, computed as follows between two data points and :
Example
Consider a dataset with the following coordinates:
| Point ID | Coordinate (x, y) |
| A | (1, 2) |
| B | (1, 4) |
| C | (3, 4) |
| D | (5, 2) |
| E | (5, 4) |
| F | (6, 4) |
Suppose we want to cluster these into k=2
clusters. Initial centroids could be A
and D
. The steps would proceed as follows:
• Iteration 1: • Assign points to nearest centroids. • Recalculate new centroids.
By repeatedly applying the steps, clusters stabilize when no points switch clusters or centroids stop changing significantly.
Key Considerations
• **Choosing k
**: The number of clusters k
is user-defined and can significantly influence the results. The Elbow Method
is frequently used to select k
by plotting the total variance explained as a function of the number of clusters.
• Scalability: K-Means is computationally efficient, making it suitable for large datasets.
• Limitations:
• K-Means assumes clusters are spherical and equal in size.
• It is sensitive to outliers and noise.
• Requires pre-specification of k
, which may not be trivial for certain datasets.
Table of Summary
| Feature | Description |
| Type | Unsupervised learning (Clustering) |
| Complexity | Usually, : n = samples, k = clusters, i = iterations, d = dimensions |
| Distance Metric | Commonly Euclidean Distance |
| Initial Choice | Random initial centroids |
| Output | A partition of data points into k clusters |
| Strengths | Simple, fast, and efficient for large datasets |
| Weaknesses | Prone to outliers, assumes equal-sized circular clusters |
| Scalability | Suitable for large datasets due to efficient updates |
Variants and Extensions
• K-Means++ Initialization: An enhancement over random initialization that improves convergence speed by selecting initial centroids that are far apart from each other.
• Hierarchical K-Means: Combines hierarchical clustering methods with K-Means to avoid some of its drawbacks.
• MiniBatch K-Means: Suitable for large-scale datasets by using mini-batches to reduce computation time.
Conclusion
K-Means is an efficient and widely-used algorithm for clustering, characterized by its simplicity and scalability. Though it has limitations, various adaptations and careful parameter selection often lead to successful data partitioning in practical applications.

