Content based recommender system with sklearn or numpy

content recommendation

sklearn

numpy

recommender system

machine learning

Content based recommender system with sklearn or numpy

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A content-based recommender suggests items by comparing item features to a user's known preferences. Unlike collaborative filtering, it does not need many other users to have rated the same items. The core pipeline is simple: build item vectors, derive a user profile from liked items, and rank unseen items by similarity to that profile.

Start with Item Features

The item representation is the foundation of the system. For text-heavy items such as articles, jobs, or products, TF-IDF is often a strong baseline.

python

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3items = [
4    "science fiction space adventure",
5    "romantic comedy relationship drama",
6    "space mission and alien contact",
7    "historical biography politics",
8]
9
10vectorizer = TfidfVectorizer()
11X = vectorizer.fit_transform(items)
12print(X.shape)

Now each item is a feature vector. Similar items will tend to have similar vector directions.

The same idea works for non-text data too, but then you design the features manually or combine several feature sources.

Build a User Profile from Positive Examples

A simple user profile is the average of the feature vectors for items the user liked.

python

1import numpy as np
2
3liked_indices = [0, 2]
4user_profile = X[liked_indices].mean(axis=0)

This profile is not a magical learned embedding. It is just a summary of the content the user engaged with. That makes content-based systems easy to explain and debug.

If you have ratings rather than binary likes, you can weight the vectors by rating strength instead of averaging them equally.

Rank Unseen Items with Cosine Similarity

Once you have item vectors and a user profile, ranking is straightforward.

python

1from sklearn.metrics.pairwise import cosine_similarity
2
3scores = cosine_similarity(X, user_profile).ravel()
4for idx, score in enumerate(scores):
5    print(idx, score)

Higher cosine similarity means the item points in a direction more like the user's profile. That is the standard baseline recommendation score.

For a real system, you would filter out items the user already consumed and sort the rest by score.

You Can Do the Same Thing in NumPy

If you already have dense numeric features, scikit-learn is not required.

python

1import numpy as np
2
3item_vectors = np.array([
4    [1.0, 0.8, 0.0],
5    [0.1, 0.0, 0.9],
6    [0.9, 0.7, 0.1],
7    [0.0, 0.2, 1.0],
8])
9
10liked = [0, 2]
11user_profile = item_vectors[liked].mean(axis=0)
12
13norm_items = item_vectors / np.linalg.norm(item_vectors, axis=1, keepdims=True)
14norm_user = user_profile / np.linalg.norm(user_profile)
15scores = norm_items @ norm_user
16print(scores)

This is the same cosine-similarity idea, just without the scikit-learn helper.

Feature Quality Matters More Than Algorithm Fancyness

Content-based recommenders often live or die by feature engineering. If item features do not represent meaningful similarity, the similarity math cannot rescue the system.

For example:

weak tags produce weak profiles
sparse metadata limits coverage
overly narrow features make recommendations repetitive

That is why many strong content-based systems spend more effort on item representation than on the ranking formula itself.

Expect Filter Bubbles and Cold-Start Tradeoffs

Content-based recommendation solves item cold start better than collaborative filtering because new items can be recommended as soon as their features exist. But it has its own weaknesses:

recommendations can become too similar to past preferences
users may not get enough diversity
feature bias can dominate the ranking

This is why production systems often blend content-based scores with collaborative or popularity-based signals.

Common Pitfalls

Focusing on similarity formulas before building meaningful item features.
Forgetting to remove already-consumed items from the ranked output.
Averaging liked items blindly when rating strength or recency should matter.
Assuming content-based systems automatically provide diverse recommendations.
Using sparse or low-quality metadata and expecting strong recommendations anyway.

Summary

A content-based recommender compares item features to a user profile derived from prior preferences.
TF-IDF plus cosine similarity is a strong baseline for text-heavy items.
A user profile can be built by averaging or weighting the vectors of liked items.
NumPy and scikit-learn can both implement the core pipeline cleanly.
In practice, feature quality and diversity control matter more than using a fancy similarity formula.