Creating a sparse matrix with LightFM and print predictions

Sparse Matrix

LightFM

Predictions

Machine Learning

Recommender System

Creating a sparse matrix with LightFM and print predictions

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

LightFM is a Python library for building hybrid recommendation systems that combine collaborative filtering with content-based features. It requires interaction data (user-item pairs) as a sparse matrix in COO or CSR format from SciPy. The typical workflow is: build an interaction matrix from raw data using lightfm.data.Dataset, train the model with model.fit(), and generate predictions with model.predict(). The sparse format is essential because real-world user-item matrices are over 99% empty — a dense matrix for 100K users and 50K items would require 40GB of memory.

Building the Interaction Matrix

python

1from lightfm.data import Dataset
2from lightfm import LightFM
3import numpy as np
4
5# Raw interaction data: (user_id, item_id, rating)
6interactions = [
7    ('user_1', 'item_A', 1),
8    ('user_1', 'item_C', 1),
9    ('user_2', 'item_B', 1),
10    ('user_2', 'item_C', 1),
11    ('user_3', 'item_A', 1),
12    ('user_3', 'item_B', 1),
13    ('user_3', 'item_D', 1),
14]
15
16# Step 1: Create a Dataset and fit it to know all users and items
17dataset = Dataset()
18dataset.fit(
19    users=['user_1', 'user_2', 'user_3'],
20    items=['item_A', 'item_B', 'item_C', 'item_D']
21)
22
23# Step 2: Build the interaction matrix (sparse COO format)
24(interactions_matrix, weights) = dataset.build_interactions(
25    [(u, i, r) for u, i, r in interactions]
26)
27
28print(type(interactions_matrix))  # <class 'scipy.sparse._coo.coo_matrix'>
29print(interactions_matrix.shape)  # (3, 4) — 3 users x 4 items
30print(interactions_matrix.toarray())
31# [[1 0 1 0]
32#  [0 1 1 0]
33#  [1 1 0 1]]

Dataset.build_interactions() converts raw user-item pairs into a sparse matrix where rows are users, columns are items, and non-zero values indicate interactions.

Training the Model

python

1from lightfm import LightFM
2
3# Create and train the model
4model = LightFM(
5    no_components=30,      # Embedding dimension
6    loss='warp',           # Weighted Approximate-Rank Pairwise (for implicit feedback)
7    learning_rate=0.05,
8    random_state=42
9)
10
11model.fit(
12    interactions_matrix,
13    epochs=20,
14    num_threads=2,
15    verbose=True
16)

loss='warp' is recommended for implicit feedback (clicks, views). For explicit ratings, use loss='warp-kos' or loss='logistic'. The no_components parameter controls the embedding dimensionality.

Generating Predictions

python

1import numpy as np
2
3# Get internal user/item ID mappings
4user_id_map, _, item_id_map, _ = dataset.mapping()
5
6# Predict scores for user_1 across all items
7user_internal_id = user_id_map['user_1']
8n_items = interactions_matrix.shape[1]
9
10scores = model.predict(
11    user_ids=user_internal_id,
12    item_ids=np.arange(n_items)  # All items
13)
14
15# Map scores back to item names
16item_names = {v: k for k, v in item_id_map.items()}
17ranked_items = sorted(
18    zip([item_names[i] for i in range(n_items)], scores),
19    key=lambda x: -x[1]
20)
21
22print("Recommendations for user_1:")
23for item, score in ranked_items:
24    print(f"  {item}: {score:.4f}")
25# item_A: 0.8234  (already interacted)
26# item_C: 0.7612  (already interacted)
27# item_D: 0.3421  (new recommendation!)
28# item_B: 0.1234  (new recommendation)

model.predict() returns a score for each user-item pair. Higher scores indicate stronger predicted preference. Filter out already-interacted items to get new recommendations.

Building Sparse Matrix Manually (Without Dataset)

python

1from scipy.sparse import coo_matrix, csr_matrix
2from lightfm import LightFM
3import numpy as np
4
5# Direct construction from arrays
6user_ids = np.array([0, 0, 1, 1, 2, 2, 2])
7item_ids = np.array([0, 2, 1, 2, 0, 1, 3])
8ratings  = np.array([1, 1, 1, 1, 1, 1, 1])
9
10interactions = coo_matrix(
11    (ratings, (user_ids, item_ids)),
12    shape=(3, 4)
13)
14
15# Convert to CSR for efficient row slicing
16interactions_csr = interactions.tocsr()
17
18model = LightFM(loss='warp', no_components=30)
19model.fit(interactions_csr, epochs=20)
20
21# Predict
22scores = model.predict(0, np.arange(4))
23print(scores)  # Array of 4 scores for user 0

Adding User and Item Features

python

1from lightfm.data import Dataset
2from lightfm import LightFM
3
4dataset = Dataset()
5dataset.fit(
6    users=['u1', 'u2', 'u3'],
7    items=['i1', 'i2', 'i3'],
8    user_features=['age_young', 'age_old', 'gender_m', 'gender_f'],
9    item_features=['genre_action', 'genre_comedy', 'genre_drama']
10)
11
12# Build feature matrices
13user_features = dataset.build_user_features([
14    ('u1', ['age_young', 'gender_m']),
15    ('u2', ['age_old', 'gender_f']),
16    ('u3', ['age_young', 'gender_f']),
17])
18
19item_features = dataset.build_item_features([
20    ('i1', ['genre_action']),
21    ('i2', ['genre_comedy', 'genre_drama']),
22    ('i3', ['genre_action', 'genre_drama']),
23])
24
25(interactions_matrix, _) = dataset.build_interactions([
26    ('u1', 'i1'), ('u1', 'i3'), ('u2', 'i2'), ('u3', 'i1')
27])
28
29model = LightFM(loss='warp', no_components=30)
30model.fit(
31    interactions_matrix,
32    user_features=user_features,
33    item_features=item_features,
34    epochs=20
35)
36
37# Predictions include feature information
38scores = model.predict(0, np.arange(3), user_features=user_features, item_features=item_features)

User and item features enable the hybrid approach — the model can recommend items to users with no interaction history (cold start) based on their features.

Evaluation

python

1from lightfm.evaluation import precision_at_k, auc_score
2
3# Split interactions into train/test
4from lightfm.cross_validation import random_train_test_split
5
6train, test = random_train_test_split(interactions_matrix, test_percentage=0.2)
7
8model = LightFM(loss='warp', no_components=30)
9model.fit(train, epochs=20)
10
11# Precision@k: fraction of top-k recommendations that are relevant
12p_at_k = precision_at_k(model, test, train_interactions=train, k=5).mean()
13print(f"Precision@5: {p_at_k:.4f}")
14
15# AUC: probability that a positive item is ranked above a negative item
16auc = auc_score(model, test, train_interactions=train).mean()
17print(f"AUC: {auc:.4f}")

Common Pitfalls

Using a dense matrix instead of sparse: LightFM requires scipy.sparse matrices. Passing a NumPy dense array fails or causes massive memory usage. Always use coo_matrix, csr_matrix, or Dataset.build_interactions() to create sparse formats.
Forgetting to call dataset.fit() with all users and items: If a user or item ID appears in interactions but was not passed to dataset.fit(), it is silently dropped. Always include all possible user and item IDs in the fit call, including those in the test set.
Confusing external and internal IDs: model.predict() uses internal integer IDs (0, 1, 2...), not your original string IDs. Use dataset.mapping() to get the mapping between external IDs and internal indices.
Not passing train_interactions to evaluation: precision_at_k and auc_score need train_interactions to exclude already-seen items from the evaluation. Without it, the metrics are inflated because the model "recommends" items the user already interacted with.
Using loss='warp' with explicit ratings: WARP loss is designed for implicit feedback (binary interactions). For explicit ratings (1-5 stars), use loss='logistic' or loss='warp-kos'. Using WARP with ratings ignores the rating magnitude.

Summary

Build interaction matrices using lightfm.data.Dataset for automatic ID mapping and sparse format
Use scipy.sparse.coo_matrix or csr_matrix for manual sparse matrix construction
Train with model.fit(interactions) using loss='warp' for implicit feedback
Generate predictions with model.predict(user_id, item_ids) — returns scores, not rankings
Add user and item features for hybrid recommendations that handle cold-start users
Use dataset.mapping() to convert between external IDs and internal integer indices