KMeans clustering in PySpark
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
KMeans in PySpark is the distributed version of a familiar clustering algorithm: assign points to the nearest centroid, recompute centroids, and repeat until convergence. In Spark, the main practical work is preparing a feature vector column and choosing a sensible value for k.
Prepare the Feature Vector
Spark ML algorithms expect a single vector column, usually called features. If your data starts as normal numeric columns, create that vector first with VectorAssembler.
Without this step, KMeans has nothing to cluster because Spark ML does not train directly from separate numeric columns.
Fit a KMeans Model
Once the features column exists, training is straightforward:
The important parameters are:
- '
kfor the number of clusters' - '
seedfor reproducibility' - '
featuresColfor the input vector' - '
predictionColfor the assigned cluster id'
After fitting, inspect the learned centroids:
That gives you the numeric center of each cluster.
Assign Points to Clusters
Use transform to label each row:
This adds a cluster id for every point. The ids themselves are arbitrary. Cluster 0 is not inherently "better" than cluster 1; it is just one of the learned groups.
Evaluate the Clustering
KMeans always returns some clustering, even if it is not meaningful. You need a way to judge whether the result is useful.
Spark includes ClusteringEvaluator, which can compute a silhouette score:
A higher score generally indicates better-separated clusters, though the score is only one signal. Domain knowledge still matters.
Choosing k
The algorithm requires k up front, which is both its strength and its weakness. If you pick too few clusters, distinct groups get merged. If you pick too many, you get noisy, fragile partitions.
A practical pattern is to try several values:
That will not "prove" the correct answer, but it gives you a sensible starting point.
Scale Features When Needed
KMeans uses distance, so feature scale matters. If one feature ranges from 0 to 1 and another ranges from 0 to 1,000,000, the larger-scale feature dominates the clustering.
For mixed-scale data, use a scaler before training. Spark provides StandardScaler and other transformers for that purpose.
Common Pitfalls
The most common mistake is forgetting the features vector column. Raw numeric columns are not enough for pyspark.ml.clustering.KMeans.
Another issue is picking k arbitrarily and treating the result as truth. KMeans always finds clusters, even in data that does not naturally cluster well.
Feature scale is another major trap. Unscaled features can produce clusters that mostly reflect units of measurement rather than real structure.
Finally, KMeans is sensitive to initialization and outliers. A fixed seed helps reproducibility, but it does not remove the algorithm's assumptions about roughly spherical, distance-based clusters.
Summary
- Build a
featuresvector withVectorAssemblerbefore fitting KMeans. - Train with
KMeans(...).fit(...)and label rows withtransform(...). - Inspect cluster centers and evaluate the result instead of trusting cluster ids blindly.
- Try multiple values of
kand compare them with a metric such as silhouette score. - Scale features when dimensions have very different numeric ranges.

