ML Systems & Infrastructure
ML Fundamentals for Engineers
Data Infrastructure
Training Infrastructure
Model Serving
Evaluation and Testing
Production Operations
Specialized Systems and Capstone
Recommendation Systems
Recommendation is a retrieval and ranking problem. Given a catalog of millions of items and a single user, the system must select a handful of items that the user is most likely to engage with — click, watch, purchase, or save. This sounds simple until you consider the scale: evaluating every item for every user on every request is computationally impossible. A catalog of 10 million items and 100 million users produces a trillion possible user-item pairs. No model, regardless of sophistication, can score all of them in real time.
The foundation of every recommendation system is the user-item interaction matrix. Rows are users, columns are items, and each cell records an interaction — a rating, a click, a purchase, or nothing at all. This matrix is almost entirely empty. On Netflix, a typical user has watched a few hundred titles out of tens of thousands available. On Amazon, a user has purchased maybe a hundred products out of hundreds of millions. Sparsity rates above 99% are normal. The entire challenge of recommendation is predicting what belongs in those empty cells.
Implicit vs Explicit Feedback
Explicit feedback means the user directly tells you their preference — a 5-star rating, a thumbs up, a "not interested" click. This data is clean and interpretable but rare. Most users never rate anything. Implicit feedback comes from observed behavior — what they clicked, how long they watched, what they added to cart, what they scrolled past. Implicit signals are abundant but noisy. A user watching 30 seconds of a video might mean they loved the intro or hated it and left. Interpreting implicit feedback correctly is one of the hardest practical challenges in recommendation.
The practical difference between these feedback types shapes the entire system design. With explicit ratings, you can train models to predict a numeric score (regression). With implicit feedback, you typically train binary classifiers — will the user interact with this item or not? — and use the predicted probability as a ranking score. The training data distribution also differs dramatically: explicit feedback is balanced (users give both high and low ratings) while implicit feedback is extremely skewed toward positives (you observe what users did but rarely what they chose not to do).
Business Objectives Mapped to ML Objectives
The business wants engagement, revenue, and diversity. The ML model needs a concrete optimization target. Engagement maps to click-through rate prediction. Revenue maps to expected value per recommendation (probability of purchase multiplied by item price). Diversity is harder — it requires the system to balance relevance with coverage across categories, which often conflicts with pure relevance optimization. A model trained purely on click-through rate will recommend the same popular items to everyone, which maximizes short-term clicks but degrades the user experience over time.
Multi-objective optimization is how production systems handle these competing goals. Rather than optimizing a single metric, the model predicts multiple outcomes — probability of click, probability of purchase, expected watch time, probability of share — and combines them with business-defined weights. The weights reflect strategic priorities: a platform focused on growth might weight engagement heavily, while a marketplace might weight purchase probability and revenue. These weights are tuned through online A/B tests where different weighting schemes compete against each other on real user traffic.
Why Two Stages
A single ranking model that scores every item in the catalog for every request does not scale. If your ranking model takes 1 millisecond per item and your catalog has 10 million items, scoring the full catalog takes 10,000 seconds per request. The solution is a two-stage architecture: candidate generation and ranking.

Candidate generation is the cheap filter. It uses lightweight models (embedding lookups, approximate nearest neighbors, simple heuristics) to narrow millions of items down to hundreds or low thousands. These models sacrifice precision for speed — they miss some good items but run in milliseconds. The ranking stage then applies a complex, feature-rich model to this small candidate set to produce the final top 10 or 20 items the user sees.
Many production systems actually use multiple candidate generators in parallel — one based on collaborative filtering embeddings, another on content similarity, a third on trending items, and a fourth on geographic relevance. The union of their outputs forms the candidate set for ranking. This multi-source approach increases recall (the chance that a truly relevant item makes it past candidate generation) at minimal additional latency cost since the generators run concurrently.
The two-stage architecture exists because running a complex ranking model on millions of items per request is prohibitively expensive. Candidate generation acts as a cheap filter — using fast, approximate methods to narrow millions down to hundreds — so the expensive ranking model only scores a manageable set. This is the same principle behind database indexing: reduce the search space before applying expensive operations.
Evaluation Metrics
Measuring recommendation quality requires metrics that go beyond simple accuracy. Precision at K measures what fraction of the top K recommended items are relevant. Recall at K measures what fraction of all relevant items appear in the top K. NDCG (Normalized Discounted Cumulative Gain) accounts for the position of relevant items — a relevant item at position 1 counts more than one at position 10 because users are more likely to see items higher in the list. In practice, teams track online metrics (click-through rate, conversion rate, session duration) alongside offline metrics because offline evaluations do not capture the full user experience.
A critical gap between offline and online evaluation is the presentation bias problem. Offline datasets only contain interactions with items that the previous recommendation system chose to show. Items the old system never surfaced have no interaction data, so offline evaluation systematically underestimates their potential. A new model that would recommend better items might appear worse in offline evaluation simply because those items were never tested. This is why online A/B testing remains the gold standard — it measures user response to the actual recommendations, not a simulation biased by historical presentation decisions.
Real-World System Considerations
Production recommendation systems serve requests at massive scale — YouTube processes billions of recommendation requests per day. This requires careful engineering beyond the model itself: embedding indexes must fit in memory across a distributed serving fleet, feature stores must return features within single-digit milliseconds, and the entire pipeline from candidate generation through ranking and reranking must complete within a latency budget (typically 100-200 milliseconds). Teams spend as much engineering effort on serving infrastructure as on model development.
Feedback loops are a systemic risk. The model recommends items, users interact with those items, and the interaction data trains the next version of the model. If the model has a bias (say, over-recommending a particular category), users interact more with that category, which reinforces the bias in the next training cycle. Breaking these loops requires deliberate countermeasures: exploration, counterfactual evaluation, and holdout sets where a random policy (not the model) selects recommendations to provide unbiased training signal.
Collaborative filtering is built on one assumption: users who agreed in the past will agree in the future. If User A and User B both loved the same 20 movies, and User A loved a 21st movie that User B has not seen, then User B will probably like it too. No information about the movies themselves is needed — no genre, no director, no description. The signal comes entirely from patterns of co-occurrence in user behavior.
User-Based Collaborative Filtering
The simplest version finds users similar to you and recommends what they liked. Given a target user, compute a similarity score (cosine similarity or Pearson correlation) between their interaction vector and every other user's vector. Find the top K most similar users. Aggregate the items those users liked that the target user has not seen, weighted by similarity. This is intuitive but has a fatal scaling problem: comparing one user against 100 million others on every request is slow. Precomputing all pairwise similarities produces a 100-million-by-100-million matrix, which is impractical to store or update.
Cosine similarity measures the angle between two vectors, ignoring magnitude. This means a user who rated 10 movies and a user who rated 1,000 movies can still be compared meaningfully — cosine similarity focuses on the pattern of preferences, not how prolific the user is. Pearson correlation goes further by centering each user's ratings around their mean, accounting for the fact that some users rate generously (average 4.5 stars) while others rate harshly (average 2.5 stars). The choice between them affects which users are considered "similar" and can meaningfully change the recommendations.

Item-Based Collaborative Filtering
Instead of finding similar users, find similar items. If a user liked Item A, find items that tend to co-occur with Item A across all users. Item similarity is more stable than user similarity — the set of items changes slowly while user preferences shift constantly. Amazon popularized this approach with "customers who bought X also bought Y." Precomputing item-item similarity is feasible because item catalogs are smaller than user bases (millions of items vs hundreds of millions of users), and item relationships change slowly enough that periodic batch recomputation works.
The item-item similarity computation considers only users who interacted with both items. If 1,000 users bought both Item A and Item B, and 800 of them also bought Item C, the similarity between A and C is high. This co-purchase signal captures relationships that item features alone might miss — users who buy diapers also buy beer, a pattern that no content-based system would discover because the item features share nothing in common.
Matrix Factorization
Both user-based and item-based approaches struggle with sparsity. The interaction matrix is 99%+ empty, so computing similarity from sparse vectors is unreliable. Matrix factorization solves this by decomposing the sparse user-item matrix into two dense, low-dimensional matrices: a user matrix and an item matrix. Each user becomes a dense vector of K latent factors (say K=50). Each item becomes a dense vector of K latent factors. The dot product of a user vector and an item vector predicts the rating or interaction score.
The math is straightforward. If R is the user-item matrix, find matrices U (users by K) and V (items by K) such that R is approximately equal to U multiplied by V-transpose. SVD (Singular Value Decomposition) gives the optimal factorization, but does not handle missing values well. ALS (Alternating Least Squares) alternates between fixing U and solving for V, then fixing V and solving for U, iterating until convergence. This handles missing values naturally — you only optimize over observed entries.
The choice of K (the number of latent dimensions) controls the trade-off between expressiveness and generalization. Too few dimensions (K=5) cannot capture the complexity of user preferences — a user who likes both action movies and French cinema needs enough dimensions to represent both interests. Too many dimensions (K=500) risk overfitting to noise in the sparse data, memorizing specific user-item interactions rather than learning generalizable patterns. In practice, K values between 50 and 200 work well for most recommendation problems, and the optimal value is found through cross-validation.
In system design interviews, interviewers test whether you understand the sparsity problem: 99%+ of the user-item matrix is empty. Matrix factorization works despite this because it compresses the sparse, high-dimensional matrix into dense, low-dimensional vectors that capture latent patterns. The key insight is that a 50-dimensional embedding can represent user taste better than a million-dimensional sparse vector.
Scaling Challenges
The interaction matrix grows as users and items increase. Adding 1 million new users adds 1 million rows. Adding 100,000 new items adds 100,000 columns. Retraining the full factorization becomes expensive. Solutions include incremental updates (fold new users into existing item factors without full retraining), distributed training (split the matrix across machines), and online learning (update factors with each new interaction). In practice, most systems retrain periodically (daily or hourly) and use the existing model for real-time serving between retrains.
Distributed matrix factorization partitions the interaction data across worker machines. Each worker updates a subset of user and item factors using its local data, then synchronizes with other workers. Parameter server architectures coordinate this by centrally storing the factor matrices while workers pull and push updates. At the scale of Spotify (600+ million users, 100+ million tracks), this distributed training is the only way to complete a full retraining cycle within a few hours.
Implicit Feedback in Collaborative Filtering
Most real-world systems run on implicit feedback — clicks, views, purchases, time spent — rather than explicit ratings. Implicit feedback requires different treatment: there are no negative signals (a user not clicking is not the same as disliking), confidence varies (watching 90% of a video is stronger than watching 10%), and the data is much denser than explicit ratings. Weighted matrix factorization assigns a confidence weight to each observation, with higher confidence for stronger signals (purchase beats view) and low confidence for missing entries.
Negative sampling is a key technique for training on implicit data. Since the vast majority of user-item pairs have no interaction, training on all of them as negatives would be computationally prohibitive and would overwhelm the positive signal. Instead, for each positive interaction, randomly sample a small number (say 4-10) of unobserved items as negatives. This creates a balanced training set that is computationally tractable. The sampling can be uniform (each unobserved item has equal probability of being selected as a negative) or popularity-biased (popular items are more likely negatives, which helps the model learn to distinguish genuine preference from popularity).
Limitations of Collaborative Filtering
The most fundamental limitation is the cold-start problem — collaborative filtering cannot work for new users or new items that have no interaction history. A new user has an empty row in the interaction matrix, so there is nothing to compare against other users. A new item has an empty column, so no user behavior exists to drive recommendations. This limitation is what motivates content-based and hybrid approaches.
Another limitation is the popularity bias. Items with many interactions dominate the similarity calculations, so collaborative filtering tends to recommend popular items over niche ones. A user with unique tastes gets pushed toward mainstream choices because the model has seen more data for popular items. Addressing this requires explicit popularity debiasing — down-weighting popular items in the training loss or reranking results to reduce popularity concentration.
Regularization in Matrix Factorization
Without regularization, matrix factorization overfits to the observed entries — the model memorizes exact ratings rather than learning generalizable patterns. L2 regularization (adding a penalty proportional to the squared magnitude of the factor vectors) prevents this by shrinking the latent factors toward zero, encouraging the model to find simpler, more generalizable representations. The regularization strength is a hyperparameter that trades off fitting the training data against generalization to new user-item pairs. Too little regularization and the model overfits; too much and the model underfits, producing bland recommendations that ignore individual preferences.
Bias terms are another important component. Users have different baselines — some rate everything highly, others are harsh critics. Items have different inherent quality — blockbuster movies get higher average ratings than indie films regardless of who rates them. Adding user bias and item bias terms to the factorization model captures these baseline effects, allowing the latent factors to focus on modeling the genuine interaction between user preferences and item characteristics rather than being confounded by systematic differences in rating behavior.
Content-based filtering recommends items similar to what a user has liked before, using the items' own features rather than other users' behavior. If a user watched three sci-fi movies directed by Denis Villeneuve, a content-based system recommends more sci-fi films and more Villeneuve films — because the item features match, not because other users liked the same combination.
How Content-Based Filtering Works
Each item is represented as a feature vector: genre tags, text descriptions, image embeddings, price range, creator, publication date. Each user is represented as a profile built from the features of items they have interacted with. Recommendation is then a similarity search — find items whose feature vectors are closest to the user profile vector. TF-IDF on text descriptions, one-hot encoding of categorical features, and pre-trained neural network embeddings for images or text all produce usable item vectors.
The advantage is clear: content-based filtering works for new items immediately. A newly added product with a description and category can be recommended on day one, before anyone has interacted with it. There is no item cold-start problem. The system also explains its recommendations naturally: "Recommended because you liked similar sci-fi movies."
Modern content-based systems use pre-trained language models (BERT, sentence-transformers) to encode item text descriptions into dense embeddings that capture semantic meaning, not just keyword overlap. Two items with completely different wording but similar meaning end up with similar embeddings. Image embeddings from pre-trained vision models (ResNet, CLIP) add visual similarity signals for products, videos, and other visual content. These embedding-based representations are far more powerful than traditional TF-IDF or one-hot approaches because they capture nuanced similarity in a compact vector space.
The disadvantage is equally clear: the system can only recommend more of the same. It has no mechanism for serendipity — discovering that a user who loves sci-fi would also love a particular cooking show. It stays within the feature space of what the user already consumed.
Hybrid Approaches
Production recommendation systems are almost always hybrids that combine collaborative and content-based signals. The combination can take several forms:
Weighted hybrid — Run collaborative filtering and content-based filtering independently, then combine their scores with learned weights. Simple to implement and deploy. Netflix's early system used this approach. The weights can be global (same for all users) or personalized (users with more history get higher collaborative weight, users with less history get higher content-based weight).
Switching hybrid — Use content-based filtering when collaborative data is sparse (new users, new items) and switch to collaborative filtering once enough interaction data accumulates. This directly addresses cold start by routing around it. The switch can be gradual (blending scores based on data availability) or binary (pure content-based below a threshold, pure collaborative above it).
Feature augmentation — Use the output of one model as input to another. Collaborative filtering produces user and item embeddings, which become features in a content-based ranking model alongside item metadata. This lets the ranking model benefit from both behavioral patterns and item properties. This is the most common production approach because it naturally handles the transition from cold start (content features dominate when collaborative signals are sparse) to mature user states (collaborative embeddings dominate once interaction data is rich).
Deep Learning Hybrids
Modern systems use neural architectures that inherently combine both signals. Neural collaborative filtering replaces the dot product in matrix factorization with a neural network that learns a nonlinear interaction function between user and item embeddings. The network can capture complex conditional preferences that a linear dot product cannot represent — for instance, a user who likes action movies by established directors but not by newcomers. Two-tower models process user features through one neural network tower and item features through another, producing embeddings in a shared space. The user tower ingests demographics, browsing history, and collaborative signals. The item tower ingests metadata, text embeddings, and popularity features. Recommendation is an approximate nearest neighbor search in the shared embedding space.
YouTube, Spotify, and Netflix all use variants of two-tower models in their candidate generation stage. The architecture scales because item embeddings can be precomputed and indexed, and only the user tower runs at serving time to produce the query embedding.
Embedding-Based Retrieval
A key enabler for hybrid systems at scale is approximate nearest neighbor (ANN) search. Once you have user and item embeddings in a shared vector space — whether from matrix factorization, two-tower models, or content encoders — retrieval reduces to finding the nearest item vectors to the user query vector. Libraries like FAISS, ScaNN, and Annoy build index structures (IVF, HNSW graphs) that return approximate nearest neighbors in milliseconds even for indexes containing hundreds of millions of vectors. The trade-off is recall: ANN search might miss some true nearest neighbors in exchange for speed, but in practice 95%+ recall is achievable with proper index tuning. This is acceptable because candidate generation does not need perfect recall — the ranking stage corrects for any misses.
Candidate generation narrows millions of items to hundreds. Ranking is where the real decision happens — scoring those hundreds and selecting the final items the user sees. The ranking model has a luxury that candidate generation does not: it can afford to be complex because it processes a small set.
Feature Categories
Ranking models combine four types of features:
User features capture who the user is: demographics (age, location, language), account-level statistics (days since signup, total purchases, average session length), and behavioral summaries (most-clicked categories, time-of-day usage patterns).
Item features describe the candidate: popularity metrics (total views, trending score), recency (publication date, last update), category and tags, price, seller rating, and content quality signals (completion rate, average rating).
Context features capture the moment: current time, day of week, device type, location, whether the user just searched for something specific, and position in the session (first visit vs returning within an hour).
Interaction features bridge user and item: has the user seen this item before, how many times, did they click similar items recently, what is the user's purchase rate in this category. These cross features are often the most predictive signals in the model.
Model Architectures
Gradient-boosted trees (XGBoost, LightGBM) dominate when features are mostly tabular. They handle mixed feature types (continuous, categorical, sparse), require minimal preprocessing, and train quickly. Most production ranking systems at mid-scale companies use gradient-boosted trees because they offer strong performance with manageable complexity. They are also interpretable — feature importance scores tell you which signals drive rankings, making it easier to debug unexpected results and explain decisions to stakeholders.
Deep ranking models shine when you need to combine sparse categorical features (user ID, item ID) with dense features (embeddings, numerical features). Wide and Deep (Google, 2016) uses a wide linear component for memorization of specific feature interactions and a deep neural network for generalization. DeepFM replaces the wide component with a factorization machine that automatically captures pairwise feature interactions without manual feature engineering.
In practice, feature engineering matters more than model architecture. A gradient-boosted tree with 200 well-crafted features often outperforms a deep model with 50 features. The best teams invest heavily in feature pipelines — real-time feature stores, feature freshness, and feature coverage monitoring.
Pointwise, Pairwise, and Listwise Ranking
Ranking models can be trained with different loss functions that reflect different views of the problem. Pointwise models predict the relevance of each item independently — a classification or regression problem. This is the simplest approach but ignores the relative ordering of items. Pairwise models learn to order pairs of items correctly: given items A and B, the model learns that A should rank above B when A is more relevant. LambdaMART (used by Microsoft Bing) is the most well-known pairwise approach. Listwise models optimize the entire ranked list directly, optimizing metrics like NDCG over the full result set. Listwise approaches are theoretically optimal but harder to train because the loss function involves sorting operations that are not differentiable. In practice, pairwise approaches offer the best trade-off between ranking quality and training stability.
Reranking for Business Rules
The ranking model produces a relevance-ordered list, but relevance alone is not enough. Reranking applies business constraints on top:
Diversity — If the top 10 items are all from the same category, the user sees a monotonous feed. Reranking enforces category diversity by swapping in items from underrepresented categories, even if their relevance score is slightly lower.
Freshness — Boost recently published items to ensure new content gets exposure. Without this, popular older items dominate forever because they have accumulated more engagement data.
Promoted content — Sponsored items or editorially curated content must appear at specific positions regardless of model score. The reranker inserts these at predetermined slots.
Deduplication — Remove near-duplicate items (same product from different sellers, same news story from different sources) that would waste screen real estate.
Position bias correction — Users are more likely to click items at the top of the list regardless of relevance. Without correction, the model learns that higher positions cause clicks, reinforcing a feedback loop where already-top-ranked items stay on top. Reranking can apply inverse propensity weighting to correct for this: items that were shown in less prominent positions get boosted credit for the engagement they did receive.
Reranking is typically rule-based rather than model-based. The rules are simple, interpretable, and easy to adjust without retraining a model. However, some companies are moving toward learned reranking models that directly optimize list-level objectives (maximize session engagement rather than per-item click probability), treating the final slate of items as a single unit rather than a collection of independent recommendations.
The Feature Store
Ranking models depend on features computed at different timescales. User demographics change rarely. Item popularity updates daily. Session context changes every minute. A feature store serves as the central system that computes, stores, and serves features to the ranking model with appropriate freshness guarantees for each feature type.
Batch features (user lifetime statistics, item aggregate metrics) are precomputed in offline pipelines and stored in a key-value store for fast lookup. Near-real-time features (items viewed in the last hour, trending scores) are computed by streaming pipelines that process event logs with seconds of latency. Real-time features (items clicked in this session, time since last interaction) are computed at serving time from the request context.
The challenge is ensuring consistency between training and serving. If the model trains on features computed from a data warehouse but serves with features from a different pipeline, subtle differences in computation logic produce training-serving skew that silently degrades ranking quality. The best teams use a unified feature definition that generates both the training data extraction and the serving computation from the same specification.
Feature coverage is another operational concern. If 5% of users are missing a critical feature (say, their most-clicked category) because the feature pipeline failed for them, the model receives a default value that may not represent their actual preference. Monitoring feature coverage — the fraction of requests where each feature has a valid, non-default value — catches these silent failures before they impact ranking quality. Production systems set alerts when coverage drops below a threshold for any feature in the top 50 by importance.
Every recommendation system faces two states where its core models fail: new users who have no history, and new items that have no interactions. These are cold-start problems, and solving them determines whether users stay through their first session and whether new items ever gain traction.
Cold start is not just a technical inconvenience — it directly impacts business metrics. A new user who sees irrelevant recommendations in their first session is far more likely to churn than one who sees even moderately personalized content. Studies at major platforms show that first-session recommendation quality is one of the strongest predictors of 30-day retention. Similarly, new items that never get surfaced by the recommendation system represent wasted inventory that creators or sellers invested effort to produce, damaging marketplace health.
New User Cold Start
A user signs up and opens the app for the first time. The collaborative filtering model has no user vector. The ranking model's interaction features are all zeros. What do you show them?
Popularity fallback — Show the most popular items globally or by region. This is the safest default because popular items have the highest base rate of appeal. It is generic but not terrible.
Onboarding preferences — Ask the user to select genres, topics, or sample items they like during signup. This immediately builds a content-based profile that can drive first-session recommendations. Spotify does this with artist selection, Netflix with movie selection.
Demographic priors — Use registration data (age, location, language) to match the new user to existing user segments. A 25-year-old in Tokyo sees different defaults than a 55-year-old in Berlin, based on aggregate preferences of similar demographic groups.
Content-based bootstrap — If the user arrives via a specific link (a shared playlist, a product page, a search result), use that entry context as a seed for content-based recommendations in the same session.
Social graph bootstrap — If the platform has social features, use the new user's friends' preferences as a warm-start signal. A new user on a music platform who connects their social account can immediately receive recommendations based on what their friends listen to, providing a personalization signal without requiring any direct interaction.

The transition matters as much as the initial strategy. A good system blends cold-start signals with learned preferences as data accumulates. After 5 interactions, the user has a weak collaborative signal. After 50, the cold-start signals can be phased out. Systems that handle this transition poorly leave users stuck with generic recommendations for too long or switch too abruptly to model-driven results.
New Item Cold Start
New items face the inverse problem — the system has no interaction data to judge their quality or relevance. Without intervention, new items never get shown, never accumulate interactions, and remain invisible forever.
Solutions include injecting new items into random user feeds with a freshness boost (guaranteed exposure), using content-based features to match new items with users whose profiles align with the item's metadata, and leveraging creator or brand reputation as a prior (a new video from a popular creator gets a higher starting score).
The new-item cold-start problem is especially acute for marketplaces and content platforms where inventory turns over quickly. A news aggregator adds thousands of articles per hour. An e-commerce marketplace onboards hundreds of new sellers daily. If the recommendation system cannot surface new items quickly, sellers and creators leave the platform because their content never gets discovered. This makes new-item cold start a business-critical problem, not just a modeling inconvenience.
The interaction between cold start and exploration is direct: exploration is the mechanism that resolves cold start. A new item only escapes cold start when the system shows it to enough users to accumulate interaction data. Without exploration, new items depend entirely on content-based matching, which may place them in front of too few users too slowly. Allocating a portion of the exploration budget specifically to new items ensures they reach critical mass — the point where enough interactions exist for collaborative filtering to take over and provide accurate relevance predictions.
Exploration vs Exploitation
Exploitation means recommending items that the model predicts the user will like most — the safe, high-confidence choices. Exploration means deliberately showing items with uncertain predictions to gather new information. Pure exploitation creates filter bubbles: the model only reinforces what it already knows about the user, never discovering latent interests.
Epsilon-greedy — With probability 0.95, show the model's top picks (exploit). With probability 0.05, show random items (explore). Simple but wasteful — random exploration often shows completely irrelevant items.
UCB (Upper Confidence Bound) — Score each item by its predicted relevance plus an uncertainty bonus. Items with few observations have high uncertainty, so they get boosted. As they accumulate data, the uncertainty shrinks and the score converges to the true prediction. This balances exploration and exploitation automatically.
Thompson sampling — Maintain a probability distribution over each item's expected reward. Sample from the distribution to decide what to show. Items with high uncertainty have wide distributions, so they occasionally sample high values and get shown. This is more statistically efficient than epsilon-greedy and naturally adapts exploration intensity to the uncertainty level.

Contextual Bandits
Standard bandits treat every user the same — the same exploration rate, the same uncertainty estimates. Contextual bandits condition the arm selection on user context: demographics, recent behavior, time of day, device. This means the system explores differently for different users — a power user who has rated 1,000 items gets less exploration (the model is confident about their preferences) than a new user who has rated 10 items (the model is uncertain and needs more data).
In practice, contextual bandits are often used for specific decisions within the recommendation pipeline rather than replacing the entire system. Common applications include: choosing which candidate generation source to weight most heavily for a given user, selecting the exploration rate for a user segment, deciding how much to boost fresh content for a specific context, and personalizing the diversity-relevance trade-off in reranking. Each of these decisions is framed as an arm selection problem where the context is the user state and the reward is a downstream engagement metric.
Pure exploitation creates echo chambers where the model only shows what users already like, reinforcing existing preferences and suppressing discovery. Over time, this reduces long-term engagement because users feel the experience is stale and predictable. Most production systems allocate 5-10% of impressions to exploration — a small cost that prevents the feedback loop from collapsing into a filter bubble.
Exploration in Practice
Most production systems use 90-95% exploitation with 5-10% exploration. The exploration budget is not uniform — it concentrates on user segments and item categories where the model is least certain. New users get more exploration because every interaction is high-value learning. New items get boosted exploration to ensure they accumulate enough data for the model to learn their quality. Mature user-item pairs where the model is confident get almost no exploration, preserving the quality of the experience.
Measuring Exploration Effectiveness
Exploration investments are hard to measure because the payoff is delayed. Showing an uncertain item today might not produce a click, but the data gathered might improve recommendations for the next month. Teams track exploration metrics over longer time windows: user retention at 30 days, diversity of items consumed per user, discovery rate (fraction of recommendations that introduce the user to a new category), and model improvement rate (how quickly prediction accuracy improves for explored items). A/B tests comparing exploration strategies must run for weeks, not days, to capture these long-horizon effects.
The ultimate test is counterfactual: would the model have learned about this user's hidden preferences without exploration? Systems that never explore eventually converge to a narrow subset of the catalog, serving the same popular items to most users. Systems with well-calibrated exploration maintain broader coverage, surface long-tail items, and adapt faster when user preferences shift — all of which compound into better retention and lifetime engagement.