Offline Evaluation Metrics

Course

ML Systems & Infrastructure

Offline Evaluation Metrics

Topics Covered

Classification Metrics in Depth

Precision and Recall

F1 Score

PR-AUC and ROC-AUC

Why ROC-AUC Misleads on Imbalanced Data

Threshold Selection

Calibration

Ranking and Retrieval Metrics

NDCG (Normalized Discounted Cumulative Gain)

MRR (Mean Reciprocal Rank)

MAP (Mean Average Precision)

Precision@K

Choosing the Right Ranking Metric

Regression and Forecasting Metrics

MSE (Mean Squared Error)

MAE (Mean Absolute Error)

Choosing Between MSE and MAE

MAPE (Mean Absolute Percentage Error)

R-squared

Pitfalls of Offline Evaluation

Data Leakage

Class Imbalance and Stratified Evaluation

Small Test Sets and Statistical Significance

The Offline-Online Gap

Classification Metrics in Depth

Accuracy is the first metric everyone learns and the first one that breaks. Consider a spam filter evaluated on a dataset where only 2% of emails are spam. A model that labels every single email "not spam" achieves 98% accuracy — while catching zero spam. The metric looks excellent. The model is useless. This failure is not a corner case. Class imbalance is the default in production ML: fraud detection (0.1% fraud), medical diagnosis (1% positive), content moderation (3% violations). Accuracy hides catastrophic failures on the class you actually care about.

To understand why accuracy fails and what to use instead, start with the confusion matrix — the foundation of all classification metrics. Every prediction falls into one of four categories:

True Positive (TP): Model predicts positive, and the example is actually positive. The spam filter correctly flags a spam email.
False Positive (FP): Model predicts positive, but the example is actually negative. The spam filter flags a legitimate email as spam. Users call this a false alarm.
True Negative (TN): Model predicts negative, and the example is actually negative. The spam filter correctly lets a legitimate email through.
False Negative (FN): Model predicts negative, but the example is actually positive. The spam filter lets a spam email through to the inbox.

From these four numbers, every classification metric is derived.

Precision and Recall

Precision answers: "Of the emails I flagged as spam, what fraction actually were spam?"

Precision = TP / (TP + FP)

High precision means few false alarms. When precision is 0.95, 95% of the emails the model flags are genuinely spam. The remaining 5% are legitimate emails that got wrongly flagged. Precision matters most when false positives are expensive — flagging a legitimate bank transaction as fraud blocks the customer and generates support calls.

Recall answers: "Of all actual spam emails, what fraction did I catch?"

Recall = TP / (TP + FN)

High recall means few missed positives. When recall is 0.90, the model catches 90% of spam. The remaining 10% slips through to the inbox. Recall matters most when false negatives are dangerous — missing a malicious email that contains a phishing link.

Precision and recall are in tension. You can achieve perfect recall by labeling everything as spam (you catch all spam but flood the inbox with false alarms). You can achieve perfect precision by only flagging the most obvious spam (you never make a false alarm but miss most spam). No model achieves perfect precision and perfect recall simultaneously — there is always a trade-off.

F1 Score

The F1 score is the harmonic mean of precision and recall:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Why the harmonic mean and not the arithmetic mean? Because the harmonic mean penalizes imbalance. If precision is 0.99 and recall is 0.01, the arithmetic mean is 0.50, which looks decent. The harmonic mean is 0.02, which correctly reflects that the model is terrible — it only catches 1% of positives. F1 gives you a single number that is only high when both precision and recall are high.

PR-AUC and ROC-AUC

Every classification model outputs a probability score. To make a binary decision, you set a threshold — above 0.5 means spam, below means not spam. But 0.5 is arbitrary. Different thresholds produce different precision-recall trade-offs.

PR-AUC (Precision-Recall Area Under Curve) plots precision against recall at every possible threshold and computes the area under this curve. A perfect model has PR-AUC of 1.0. A random model on a dataset with 2% positives has PR-AUC of approximately 0.02. PR-AUC is the right summary metric for imbalanced datasets because it focuses entirely on the positive class.

ROC-AUC (Receiver Operating Characteristic Area Under Curve) plots the true positive rate (recall) against the false positive rate at every threshold. A perfect model has ROC-AUC of 1.0. A random model has ROC-AUC of 0.5. ROC-AUC is useful when classes are balanced, but it can be misleading for imbalanced datasets.

Key Insight

PR-AUC is more informative than ROC-AUC for imbalanced datasets. ROC-AUC can look deceptively high because it includes true negatives in the false positive rate denominator. When 99% of examples are negative, even a poor model achieves a low false positive rate. PR-AUC ignores true negatives entirely and focuses on how well the model finds the rare positive class.

Why ROC-AUC Misleads on Imbalanced Data

To understand the difference concretely, consider a disease screening dataset with 10,000 patients, 100 of whom have the disease (1% positive rate). A model that randomly flags 500 patients as positive might catch 50 of the 100 true positives. Its false positive rate is 450/9900 = 4.5%, which looks low — but that is because the denominator (9,900 negatives) is so large. The true positive rate is 50/100 = 50%. On the ROC curve, this point (4.5% FPR, 50% TPR) looks reasonable and contributes to a decent ROC-AUC.

But look at precision: 50/500 = 10%. Only 1 in 10 flagged patients actually has the disease. The PR curve exposes this immediately. ROC-AUC hides the problem by diluting false positives across the massive negative population. This is why PR-AUC is the standard for imbalanced problems — it tells you how well the model performs on the class you care about without the denominator inflation that flatters ROC-AUC.

Threshold Selection

The threshold is not a model parameter — it is a business decision. A medical screening test should use a low threshold (high recall, catch every possible case even at the cost of false alarms). A content moderation system for auto-deleting posts should use a high threshold (high precision, only delete when very confident). The optimal threshold depends on the relative cost of false positives versus false negatives in your specific application.

Calibration

A model outputs a probability of 0.80 for a given input. Does this mean the event truly occurs 80% of the time? Calibration measures whether predicted probabilities match observed frequencies. A well-calibrated model saying 80% means that across all predictions where it outputs 0.80, the positive outcome occurs in roughly 80% of cases.

Calibration matters when probabilities drive downstream decisions. An insurance pricing model that is overconfident will underprice risky policies. A medical diagnostic model that is underconfident will send too many patients for unnecessary follow-up tests. Many models (especially neural networks) are poorly calibrated out of the box and require post-hoc calibration techniques like Platt scaling or isotonic regression.

To check calibration, use a reliability diagram: bin predictions by predicted probability (0.0-0.1, 0.1-0.2, ..., 0.9-1.0) and plot the actual positive rate in each bin. A perfectly calibrated model produces a diagonal line. Points above the diagonal mean the model is underconfident (predicts 0.3 but the true rate is 0.5). Points below mean overconfident (predicts 0.8 but the true rate is 0.6).

Ranking and Retrieval Metrics

Classification metrics treat predictions as independent binary decisions — each item is either relevant or not. But many ML systems produce ranked lists: search engines return 10 blue links, recommendation systems show a feed of content, and ad platforms rank bid candidates. For ranked outputs, the position of each result matters enormously. A search engine that returns 10 perfectly relevant results buried on page 5 is just as useless as one that returns no relevant results at all.

This is why classification metrics fail for ranking. Precision and recall have no concept of position — they treat a relevant result at rank 1 the same as one at rank 50. Ranking metrics fix this by incorporating position into the score.

NDCG (Normalized Discounted Cumulative Gain)

NDCG is the workhorse metric for ranking problems. It captures two things that other metrics miss: position matters, and relevance is not binary.

Start with the intuition. A relevant result at position 1 is worth far more than the same result at position 10. Users scan search results from top to bottom, and attention drops off logarithmically. The first result gets the most attention, the second gets less, and by position 10 hardly anyone looks.

Discounted Cumulative Gain (DCG) formalizes this. For each result at position $i$ with relevance score $rel_i$:

DCG = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i+1)}

The logarithmic denominator is the "discount" — it reduces the contribution of results at lower positions. A relevance-3 result at position 1 contributes $3/\log_2(2) = 3.0$. The same result at position 10 contributes $3/\log_2(11) = 0.87$. Position 1 is worth 3.4 times more than position 10.

Ideal DCG (IDCG) is the DCG you would get if results were sorted in perfect order — most relevant first. NDCG normalizes DCG by IDCG so the score falls between 0 and 1:

NDCG = DCG / IDCG

NDCG of 1.0 means the results are in perfect order. NDCG of 0.0 means no relevant results appear in the list. In practice, you compute NDCG at a specific cutoff — NDCG@10 for the top 10 results, NDCG@20 for the top 20.

NDCG also handles graded relevance. Instead of binary relevant/not-relevant, you can assign scores: 3 for highly relevant, 2 for somewhat relevant, 1 for marginally relevant, 0 for irrelevant. This captures the reality that not all relevant results are equally useful.

MRR (Mean Reciprocal Rank)

MRR asks a simpler question: how far down the list is the first relevant result?

For a single query, the reciprocal rank is $1/\text{rank}$ where rank is the position of the first relevant result. If the first relevant result is at position 1, reciprocal rank is 1. Position 3 gives 1/3. Position 10 gives 1/10.

MRR averages reciprocal rank across all queries. It ranges from 0 to 1, where 1 means the first result is always relevant.

MRR is useful for navigational queries — queries with a single correct answer where the user wants it as the first result. "What is the capital of France?" should return Paris at position 1. MRR penalizes answers that appear lower in the list.

The limitation of MRR is that it only cares about the first relevant result. For queries where multiple relevant results matter (product search, content recommendations), MRR ignores everything after the first hit.

MAP (Mean Average Precision)

MAP computes precision at each position where a relevant result appears, then averages those precision values. For a query with relevant results at positions 1, 3, and 7 out of 10 results:

Precision at position 1: 1/1 = 1.0
Precision at position 3: 2/3 = 0.67
Precision at position 7: 3/7 = 0.43
Average Precision = (1.0 + 0.67 + 0.43) / 3 = 0.70

MAP averages this across all queries. It rewards models that rank relevant results higher — if the same three relevant results appeared at positions 1, 2, and 3, the average precision would be 1.0.

MAP treats all relevant results as equally important (binary relevance). This is appropriate when every relevant result matters equally, but not when some results are more relevant than others. For graded relevance, use NDCG.

Precision@K

Precision@K asks: of the top K results, what fraction is relevant? Precision@5 for a recommendation system tells you how many of the 5 items shown to the user are actually interesting.

This is the simplest ranking metric and maps directly to user experience — if a user sees 5 recommendations and 3 are relevant, Precision@5 is 0.60. It does not account for the ordering within the top K, though. Showing relevant results at positions 1, 2, 3 scores the same as positions 3, 4, 5.

Choosing the Right Ranking Metric

The right metric depends on your product's interaction pattern:

Single-answer queries (voice assistants, factual QA): use MRR. Users want the one correct answer at position 1.
Browsing queries (product search, news feeds): use NDCG. Users scan multiple results and relevance varies in degree.
Fixed-slot recommendations (homepage widgets showing 5 items): use Precision@K. Users see exactly K items and you want to maximize how many are relevant.
Document retrieval (legal search, academic paper search): use MAP. Users need to find all relevant documents and order matters.

In practice, report multiple metrics. NDCG@10 as the primary metric, Precision@5 as a quick sanity check, and MRR to ensure the top result is relevant. No single metric tells the full story, but NDCG comes closest for most ranking problems.

Interview Tip

In interviews involving search or recommendation systems, default to NDCG as your primary offline metric. It captures two things other metrics miss: (1) position matters — a relevant result at rank 1 is exponentially more valuable than at rank 10, and (2) graded relevance — a highly relevant result contributes more than a somewhat relevant one. MRR only cares about the first relevant result, and MAP treats all relevant results equally.

Regression and Forecasting Metrics

Classification and ranking metrics handle categorical or ordered outputs. But many ML systems predict continuous values: tomorrow's temperature, next week's product demand, a customer's expected lifetime value, a house price. For continuous outputs, you need metrics that measure how far off the predictions are — and "how far" can be defined in several ways, each with different implications.

Consider a demand forecasting model that predicts 105 units when actual demand is 100. Is that good? It depends on context. If you are ordering perishable goods, overestimating by 5% means waste. If you are staffing a call center, underestimating by 5% means long wait times. The choice of regression metric encodes which kinds of errors you care about most.

MSE (Mean Squared Error)

MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

MSE squares each error before averaging. This has a critical consequence: large errors are penalized much more than small errors. An error of 10 contributes 100 to the sum. An error of 100 contributes 10,000 — a hundred times more, not ten times more.

This property makes MSE appropriate when large errors are disproportionately costly. In demand forecasting, predicting 200 units when actual demand is 100 is far worse than predicting 110 — you have twice the inventory to dispose of. In financial risk models, underestimating risk by a large amount can be catastrophic.

The downside is that MSE is sensitive to outliers. A single extreme error can dominate the metric, making it hard to tell whether the model is generally good but occasionally terrible, or consistently mediocre.

MAE (Mean Absolute Error)

MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

MAE takes the absolute value of each error before averaging. An error of 10 contributes 10. An error of 100 contributes 100 — ten times more, proportional to the actual size of the error. MAE treats all errors linearly, which makes it more robust to outliers than MSE.

Use MAE when you want to know the average error magnitude without amplifying extreme cases. It is also easier to interpret than MSE — an MAE of 5 means predictions are off by 5 units on average.

Choosing Between MSE and MAE

The choice comes down to how you want to penalize errors. If predicting 200 when actual is 100 is much worse than twice as bad as predicting 150, use MSE — the quadratic penalty matches the disproportionate cost. If all errors are proportionally costly regardless of magnitude, use MAE.

A practical consideration: MSE is differentiable everywhere, which makes it convenient for gradient-based optimization. MAE has a non-differentiable point at zero, which can cause optimization instability. This is why many neural networks use MSE (or its root, RMSE) as the training loss even when MAE is the evaluation metric. RMSE (Root Mean Squared Error) is simply the square root of MSE, which returns the metric to the same units as the target variable and makes it more interpretable while retaining MSE's sensitivity to large errors.

MAPE (Mean Absolute Percentage Error)

MAPE = \frac{1}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right| \times 100\%

MAPE expresses errors as percentages of actual values. This makes it scale-independent — a 10% error on a $10 item and a 10% error on a $10,000 item contribute equally. This is useful when you forecast across items of very different magnitudes.

But MAPE has a fatal flaw: it breaks when actual values are near zero. If actual demand is 1 unit and the model predicts 2, the percentage error is 100%. If actual demand is 0 and the model predicts 1, the error is undefined (division by zero). For products with intermittent or low demand, MAPE is unreliable.

MAPE also has an asymmetry problem. Underestimation is bounded (the maximum percentage error for underestimating is 100% — predicting 0 when actual is positive) but overestimation is unbounded (predicting 1,000 when actual is 1 gives 99,900% error). This asymmetry can bias model selection toward models that systematically underpredict.

R-squared

R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

R-squared measures the proportion of variance in the target variable that the model explains. It ranges from 0 to 1 in the typical case (though it can be negative for models worse than predicting the mean). An R-squared of 0.85 means the model explains 85% of the variance in the data.

R-squared is useful for comparing models — a model with R-squared 0.90 explains more variance than one with 0.75. But it does not tell you if the model is good enough in absolute terms. A model explaining 90% of variance in stock prices might still have errors too large to trade on.

R-squared also increases mechanically with more features, even if those features add noise rather than signal. Adjusted R-squared corrects for this by penalizing additional features that do not improve the model enough, but the fundamental limitation remains: R-squared measures relative improvement over a naive baseline (predicting the mean), not absolute prediction quality.

Pitfalls of Offline Evaluation

Offline metrics tell you how a model performs on historical data. They are necessary but not sufficient — a model can improve on every offline metric and still hurt the product when deployed. This section covers the most common pitfalls that make offline evaluation misleading and how to guard against them.

Data Leakage

Data leakage is the most dangerous pitfall in ML evaluation because it makes everything look perfect. Leakage happens when the training pipeline accidentally includes information that would not be available at prediction time. The model learns to use this future or forbidden information, achieves spectacular offline metrics, and then fails completely in production where that information does not exist.

The most common form is temporal leakage. Consider a model predicting tomorrow's stock price. If the feature pipeline includes tomorrow's trading volume — because the data was joined carelessly on the wrong date column — the model has trivially easy access to the target. It achieves near-perfect accuracy offline. In production, tomorrow's trading volume is not available at prediction time, so the model's performance collapses.

Target leakage is subtler. A fraud detection model includes "was this transaction reversed?" as a feature. Reversed transactions are strongly correlated with fraud because fraudulent charges get reversed. The model achieves 99% accuracy offline. In production, at the time the model makes a prediction, the reversal has not happened yet — the feature is always null. The model is useless.

Train-test split leakage happens when data that should be in the test set leaks into training. The classic example is time-series data split randomly instead of chronologically. If you randomly split daily stock prices, training data contains both January and March, and test data contains February. The model has seen the future (March) while predicting the present (February). For time-series data, always split chronologically — train on the past, test on the future.

Common Pitfall

Data leakage is the most dangerous pitfall in ML evaluation because it makes everything look perfect. A model with 99% accuracy that drops to 78% in production has almost certainly seen future data during training. The fix is disciplined feature engineering: for every feature, ask 'would this value be available at prediction time in production?' If the answer is no, remove it.

Class Imbalance and Stratified Evaluation

Aggregate metrics hide failures on important subgroups. A content recommendation model achieves 92% accuracy overall. But when you slice by user segment, performance on new users is 71% — far below the 95% on returning users. New users are 15% of the population, so their poor performance barely moves the aggregate.

Stratified evaluation means computing metrics for each meaningful subgroup: geography, user segment, device type, content category, demographic group. A model that is good on average but terrible for a critical subgroup is not ready for production.

Stratified Evaluation Reveals Hidden Failures

This also connects to fairness — a model that performs well for majority groups but poorly for minority groups may be discriminatory. Stratified evaluation is the first step in detecting such bias.

The practical approach is to define your evaluation slices before training, not after. If you wait until after you see results, you will cherry-pick slices that look good and ignore slices that look bad. Define slices based on business importance (high-value customers, regulated categories, underserved markets) and demographic groups (to detect fairness issues). Report metrics for every slice in every evaluation run, and set minimum performance thresholds for critical slices that the model must meet before deployment.

Small Test Sets and Statistical Significance

With 100 test examples, a model achieving 85% accuracy could have true accuracy anywhere from 77% to 91% (95% confidence interval). A competing model at 88% might not actually be better — the 3-point difference could be random noise.

Confidence intervals quantify this uncertainty. The wider the interval, the less you should trust small differences. Common approaches include bootstrap resampling (resample the test set thousands of times and compute the metric on each resample) and paired statistical tests (compare two models on the same examples to control for test-set variance).

A rough rule of thumb: you need at least 1,000 test examples for metric differences of 1-2% to be statistically significant. For rarer events (fraud detection at 0.1% base rate), you need far more examples to get stable estimates of precision and recall.

Teams frequently make deployment decisions based on test sets of a few hundred examples. The result is that model "improvements" are often noise. One practical guard rail: always report confidence intervals alongside point estimates. If the confidence intervals of two models overlap, do not claim one is better. Run a larger evaluation or collect more test data before making the decision.

The Offline-Online Gap

This is the ultimate pitfall. A model improves every offline metric — accuracy up 3%, F1 up 5%, NDCG up 2%. You deploy it. Online metrics (click-through rate, conversion, revenue) do not improve, or they get worse.

Several factors cause this gap:

Distribution shift: The test set was collected last month. User behavior has changed since then. The model was evaluated on stale data that no longer represents reality.
Feedback loops: A recommendation model changes what users see, which changes what they click, which changes the data the next model trains on. Offline evaluation cannot capture these dynamics because it uses static data.
Proxy metric mismatch: You optimized for click prediction (easy to measure offline) but the business cares about purchase rate (requires online measurement). The model learns to recommend clickbait — high clicks, low purchases.
Serving-time effects: Latency, caching, feature freshness, and serving infrastructure affect real-world performance in ways that offline evaluation cannot capture.

The offline-online gap cannot be eliminated, only managed. The practical workflow is: use offline metrics for fast iteration and filtering — they are cheap, fast, and run on historical data without risking user experience. Offline evaluation narrows a field of 20 model candidates to 3-4 worth testing online. Then use online experiments (A/B tests) for final validation — they are expensive, slow, and carry risk (bad models affect real users during the test), but they measure what actually matters.

A healthy ML team treats a persistent offline-online gap as a signal to improve their offline evaluation, not to abandon it. If offline metric improvements consistently fail to translate to online gains, the offline metric is measuring the wrong thing. Investigate which offline metric correlates best with your online metrics and use that as your primary offline evaluation criterion.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone