Content based recommender system with sklearn or numpy
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
A content-based recommender suggests items by comparing item features to a user's known preferences. Unlike collaborative filtering, it does not need many other users to have rated the same items. The core pipeline is simple: build item vectors, derive a user profile from liked items, and rank unseen items by similarity to that profile.
Start with Item Features
The item representation is the foundation of the system. For text-heavy items such as articles, jobs, or products, TF-IDF is often a strong baseline.
Now each item is a feature vector. Similar items will tend to have similar vector directions.
The same idea works for non-text data too, but then you design the features manually or combine several feature sources.
Build a User Profile from Positive Examples
A simple user profile is the average of the feature vectors for items the user liked.
This profile is not a magical learned embedding. It is just a summary of the content the user engaged with. That makes content-based systems easy to explain and debug.
If you have ratings rather than binary likes, you can weight the vectors by rating strength instead of averaging them equally.
Rank Unseen Items with Cosine Similarity
Once you have item vectors and a user profile, ranking is straightforward.
Higher cosine similarity means the item points in a direction more like the user's profile. That is the standard baseline recommendation score.
For a real system, you would filter out items the user already consumed and sort the rest by score.
You Can Do the Same Thing in NumPy
If you already have dense numeric features, scikit-learn is not required.
This is the same cosine-similarity idea, just without the scikit-learn helper.
Feature Quality Matters More Than Algorithm Fancyness
Content-based recommenders often live or die by feature engineering. If item features do not represent meaningful similarity, the similarity math cannot rescue the system.
For example:
- weak tags produce weak profiles
- sparse metadata limits coverage
- overly narrow features make recommendations repetitive
That is why many strong content-based systems spend more effort on item representation than on the ranking formula itself.
Expect Filter Bubbles and Cold-Start Tradeoffs
Content-based recommendation solves item cold start better than collaborative filtering because new items can be recommended as soon as their features exist. But it has its own weaknesses:
- recommendations can become too similar to past preferences
- users may not get enough diversity
- feature bias can dominate the ranking
This is why production systems often blend content-based scores with collaborative or popularity-based signals.
Common Pitfalls
- Focusing on similarity formulas before building meaningful item features.
- Forgetting to remove already-consumed items from the ranked output.
- Averaging liked items blindly when rating strength or recency should matter.
- Assuming content-based systems automatically provide diverse recommendations.
- Using sparse or low-quality metadata and expecting strong recommendations anyway.
Summary
- A content-based recommender compares item features to a user profile derived from prior preferences.
- TF-IDF plus cosine similarity is a strong baseline for text-heavy items.
- A user profile can be built by averaging or weighting the vectors of liked items.
- NumPy and scikit-learn can both implement the core pipeline cleanly.
- In practice, feature quality and diversity control matter more than using a fancy similarity formula.

