K-Means Clustering
Machine Learning
Data Analytics
Unsupervised Learning
Algorithms

K- Means algorithm

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

K-Means is one of the most popular unsupervised learning algorithms used for clustering. Clustering involves grouping a set of objects in such a way that objects in the same group (referred to as a cluster) are more similar to each other than to those in other groups. The K-Means algorithm seeks to partition a given dataset into k clusters, where each cluster is represented by its mean value, known as the cluster centroid.

Technical Explanation

Algorithm Overview

The K-Means algorithm operates through a straightforward iterative process. Here is an outline of the steps involved in the K-Means algorithm:

  1. Initialization: • Choose the number of clusters k . • Randomly select k centroids from the data points as initial cluster centers.
  2. Assignment Step: • Assign each data point to the nearest centroid using a suitable distance metric, usually Euclidean distance: • For each point x_i and centroid μ_j , calculate the distance d(xi,μj)d(x_i, μ_j). • Assign x_i to the closest centroid.
  3. Update Step: • Recalculate the centroids as the mean of the points assigned to each cluster: • μj=1CjxCjxμ_j = \frac{1}{|C_j|} \sum_{x \in C_j} x where CjC_j is the set of points in cluster j .
  4. Convergence: • Repeat the Assignment and Update steps until the centroids no longer change significantly or a pre-defined number of iterations is reached.

Distance Metric

The performance and outcome of K-Means can depend on the choice of distance metric. The standard K-Means uses Euclidean distance, computed as follows between two data points x=(x1,x2,...,xn)x = (x_1, x_2, ..., x_n) and y=(y1,y2,...,yn)y = (y_1, y_2, ..., y_n):

Euclidean distance=i=1n(xiyi)2\text{Euclidean distance} = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

Example

Consider a dataset with the following coordinates:

Point IDCoordinate (x, y)
A(1, 2)
B(1, 4)
C(3, 4)
D(5, 2)
E(5, 4)
F(6, 4)

Suppose we want to cluster these into k=2 clusters. Initial centroids could be A and D . The steps would proceed as follows:

Iteration 1: • Assign points to nearest centroids. • Recalculate new centroids.

By repeatedly applying the steps, clusters stabilize when no points switch clusters or centroids stop changing significantly.

Key Considerations

• **Choosing k **: The number of clusters k is user-defined and can significantly influence the results. The Elbow Method is frequently used to select k by plotting the total variance explained as a function of the number of clusters.

Scalability: K-Means is computationally efficient, making it suitable for large datasets.

Limitations: • K-Means assumes clusters are spherical and equal in size. • It is sensitive to outliers and noise. • Requires pre-specification of k , which may not be trivial for certain datasets.

Table of Summary

FeatureDescription
TypeUnsupervised learning (Clustering)
ComplexityUsually, O(nkid)O(n \cdot k \cdot i \cdot d): n = samples, k = clusters, i = iterations, d = dimensions
Distance MetricCommonly Euclidean Distance
Initial ChoiceRandom initial centroids
OutputA partition of data points into k clusters
StrengthsSimple, fast, and efficient for large datasets
WeaknessesProne to outliers, assumes equal-sized circular clusters
ScalabilitySuitable for large datasets due to efficient updates

Variants and Extensions

K-Means++ Initialization: An enhancement over random initialization that improves convergence speed by selecting initial centroids that are far apart from each other.

Hierarchical K-Means: Combines hierarchical clustering methods with K-Means to avoid some of its drawbacks.

MiniBatch K-Means: Suitable for large-scale datasets by using mini-batches to reduce computation time.

Conclusion

K-Means is an efficient and widely-used algorithm for clustering, characterized by its simplicity and scalability. Though it has limitations, various adaptations and careful parameter selection often lead to successful data partitioning in practical applications.


Course illustration
Course illustration

All Rights Reserved.