Apache Spark
ALS
recommendations
collaborative filtering
machine learning

Apache Spark ALS recommendations approach

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Spark's ALS algorithm is a standard choice for collaborative filtering when you have large user-item interaction data and want scalable recommendations. ALS stands for Alternating Least Squares, a matrix-factorization approach that learns latent factors for users and items.

The goal is not to hand-code business rules for every product pair. Instead, ALS infers hidden preference patterns from observed interactions such as ratings, clicks, purchases, or watch history.

What ALS Learns

Imagine a sparse matrix where rows are users, columns are items, and values are ratings or interaction strengths. Most entries are missing because any one user has seen only a tiny fraction of all items.

ALS factorizes that large sparse matrix into two smaller dense matrices:

  • user factors
  • item factors

A predicted preference score is then the dot product of one user vector and one item vector. Spark alternates between solving for users while holding items fixed, and solving for items while holding users fixed.

That alternating structure is what gives the algorithm its name.

Explicit Versus Implicit Feedback

Spark ALS supports two important cases.

Explicit Feedback

This is the classic rating scenario, such as 1 to 5 stars. The input value directly means preference strength.

Implicit Feedback

This is more common in production systems. Examples include:

  • clicks
  • purchases
  • watch time
  • listens

In implicit mode, the observed value is not a rating. It is a confidence signal that a user may prefer an item. Spark adjusts the objective accordingly.

A Basic PySpark Example

python
1from pyspark.sql import SparkSession
2from pyspark.ml.recommendation import ALS
3
4spark = SparkSession.builder.appName("als-demo").getOrCreate()
5
6data = [
7    (1, 101, 5.0),
8    (1, 102, 3.0),
9    (2, 101, 4.0),
10    (2, 103, 1.0),
11    (3, 102, 4.0),
12    (3, 103, 5.0),
13]
14
15ratings = spark.createDataFrame(data, ["userId", "itemId", "rating"])
16
17als = ALS(
18    userCol="userId",
19    itemCol="itemId",
20    ratingCol="rating",
21    rank=10,
22    maxIter=10,
23    regParam=0.1,
24    coldStartStrategy="drop",
25    nonnegative=True
26)
27
28model = als.fit(ratings)
29
30recommendations = model.recommendForAllUsers(2)
31recommendations.show(truncate=False)

This example uses explicit ratings. In implicit-feedback mode, you would add implicitPrefs=True and tune the confidence-related parameters based on the data source.

Key Parameters That Matter

Several parameters strongly affect recommendation quality:

  • 'rank controls the number of latent factors'
  • 'regParam controls regularization and overfitting risk'
  • 'maxIter sets the number of alternating optimization steps'
  • 'implicitPrefs switches the learning objective'

You should tune these with a validation set rather than guessing. For explicit feedback, metrics such as RMSE can help. For real recommender systems, ranking metrics are often more meaningful than pure rating error.

Practical Limits of ALS

ALS works well when collaborative signals dominate, but it has known weaknesses:

  • cold start for new users or new items
  • no use of item text or metadata by itself
  • popularity bias in sparse data
  • recommendation quality depends heavily on interaction quality

If you need to incorporate catalog metadata, content features, or session behavior, ALS is often only one component in a larger recommendation system.

Common Pitfalls

  • Treating clicks as explicit ratings instead of using implicit-feedback mode.
  • Ignoring cold-start behavior for unseen users or items.
  • Evaluating only RMSE when the real goal is top-k ranking quality.
  • Forgetting to set coldStartStrategy="drop" during evaluation, which can produce invalid metrics because of NaN predictions.
  • Expecting ALS alone to solve metadata, freshness, or business-rule constraints.

Summary

  • Spark ALS is a scalable collaborative-filtering method based on matrix factorization.
  • It learns latent user and item factors from sparse interaction data.
  • Spark supports both explicit ratings and implicit-feedback recommendation settings.
  • Parameter tuning and evaluation strategy matter as much as the algorithm choice.
  • ALS is strong for collaborative signals, but cold start and metadata limitations remain.

Course illustration
Course illustration

All Rights Reserved.