KMeans
Clustering
PySpark
Machine Learning
Data Science

KMeans clustering in PySpark

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

KMeans in PySpark is the distributed version of a familiar clustering algorithm: assign points to the nearest centroid, recompute centroids, and repeat until convergence. In Spark, the main practical work is preparing a feature vector column and choosing a sensible value for k.

Prepare the Feature Vector

Spark ML algorithms expect a single vector column, usually called features. If your data starts as normal numeric columns, create that vector first with VectorAssembler.

python
1from pyspark.sql import SparkSession
2from pyspark.ml.feature import VectorAssembler
3
4spark = SparkSession.builder.getOrCreate()
5
6data = [
7    (0.0, 0.0),
8    (0.1, 0.2),
9    (8.0, 8.0),
10    (8.2, 7.9),
11    (4.0, 4.1),
12]
13
14df = spark.createDataFrame(data, ["x", "y"])
15
16assembler = VectorAssembler(inputCols=["x", "y"], outputCol="features")
17training_df = assembler.transform(df)
18
19training_df.select("x", "y", "features").show(truncate=False)

Without this step, KMeans has nothing to cluster because Spark ML does not train directly from separate numeric columns.

Fit a KMeans Model

Once the features column exists, training is straightforward:

python
1from pyspark.ml.clustering import KMeans
2
3kmeans = KMeans(
4    k=3,
5    seed=42,
6    featuresCol="features",
7    predictionCol="cluster",
8)
9
10model = kmeans.fit(training_df)

The important parameters are:

  • 'k for the number of clusters'
  • 'seed for reproducibility'
  • 'featuresCol for the input vector'
  • 'predictionCol for the assigned cluster id'

After fitting, inspect the learned centroids:

python
for index, center in enumerate(model.clusterCenters()):
    print(index, center)

That gives you the numeric center of each cluster.

Assign Points to Clusters

Use transform to label each row:

python
predictions = model.transform(training_df)
predictions.select("x", "y", "cluster").show()

This adds a cluster id for every point. The ids themselves are arbitrary. Cluster 0 is not inherently "better" than cluster 1; it is just one of the learned groups.

Evaluate the Clustering

KMeans always returns some clustering, even if it is not meaningful. You need a way to judge whether the result is useful.

Spark includes ClusteringEvaluator, which can compute a silhouette score:

python
1from pyspark.ml.evaluation import ClusteringEvaluator
2
3evaluator = ClusteringEvaluator(
4    predictionCol="cluster",
5    featuresCol="features",
6)
7
8score = evaluator.evaluate(predictions)
9print("silhouette score:", score)

A higher score generally indicates better-separated clusters, though the score is only one signal. Domain knowledge still matters.

Choosing k

The algorithm requires k up front, which is both its strength and its weakness. If you pick too few clusters, distinct groups get merged. If you pick too many, you get noisy, fragile partitions.

A practical pattern is to try several values:

python
1for k in range(2, 6):
2    model = KMeans(k=k, seed=42, featuresCol="features").fit(training_df)
3    predictions = model.transform(training_df)
4    score = evaluator.evaluate(predictions)
5    print(k, score)

That will not "prove" the correct answer, but it gives you a sensible starting point.

Scale Features When Needed

KMeans uses distance, so feature scale matters. If one feature ranges from 0 to 1 and another ranges from 0 to 1,000,000, the larger-scale feature dominates the clustering.

For mixed-scale data, use a scaler before training. Spark provides StandardScaler and other transformers for that purpose.

Common Pitfalls

The most common mistake is forgetting the features vector column. Raw numeric columns are not enough for pyspark.ml.clustering.KMeans.

Another issue is picking k arbitrarily and treating the result as truth. KMeans always finds clusters, even in data that does not naturally cluster well.

Feature scale is another major trap. Unscaled features can produce clusters that mostly reflect units of measurement rather than real structure.

Finally, KMeans is sensitive to initialization and outliers. A fixed seed helps reproducibility, but it does not remove the algorithm's assumptions about roughly spherical, distance-based clusters.

Summary

  • Build a features vector with VectorAssembler before fitting KMeans.
  • Train with KMeans(...).fit(...) and label rows with transform(...).
  • Inspect cluster centers and evaluate the result instead of trusting cluster ids blindly.
  • Try multiple values of k and compare them with a metric such as silhouette score.
  • Scale features when dimensions have very different numeric ranges.

Course illustration
Course illustration

All Rights Reserved.