ML Systems & Infrastructure
ML Fundamentals for Engineers
Data Infrastructure
Training Infrastructure
Model Serving
Production Operations
Specialized Systems and Capstone
Online Experimentation
Offline evaluation tells you how a model performs on held-out data. Online experimentation tells you how it performs on real users making real decisions with real money. These are fundamentally different questions. A model that improves AUC by 3% on your test set might decrease revenue by 1% in production because the test set does not capture user behavior shifts, latency sensitivity, or interaction effects with other systems. A/B testing exists because the gap between offline metrics and business outcomes is unpredictable, and the only reliable way to close it is to measure both sides simultaneously under identical conditions.
An A/B test splits live traffic into two groups. The control group sees the existing system (model A). The treatment group sees the candidate system (model B). Both groups experience the same product surface, same UI, same everything except the one variable you are testing. You measure a primary metric (click-through rate, conversion rate, revenue per user) on both groups and compare. If the treatment outperforms control by a margin that is unlikely to be due to random chance, you ship it.

Random assignment is the mechanism that makes causal inference possible. If you let users self-select into groups, or assign based on any observable property (device type, geography, signup date), you introduce confounders. Random assignment ensures that on expectation, every measurable and unmeasurable characteristic is balanced across groups. The only systematic difference between groups is the treatment itself.
Why Offline Metrics Are Not Enough
Consider a search ranking model. Offline, you measure NDCG on a held-out set of queries with human-labeled relevance judgments. The new model scores 0.82 NDCG versus 0.79 for the production model. Three reasons this does not predict online success. First, the relevance labels were collected months ago and may not reflect current user intent. A query like "best phone 2026" has different ideal results now than when labels were created. Second, offline evaluation treats all positions equally, but users interact primarily with the top 3 results. A model that improves rank quality at positions 15 through 20 boosts NDCG without changing any user-visible behavior. Third, offline metrics ignore presentation effects: the snippet shown, the loading speed of the result page, and the visual layout all influence clicks but are absent from NDCG. These gaps mean an offline improvement is a necessary signal but never a sufficient one. The A/B test is the final arbiter.
Statistical Significance
After running the test for some period, you observe a difference in your primary metric between control and treatment. The question: is this difference real, or could random variation have produced it? Statistical significance answers this through hypothesis testing. You start with a null hypothesis: there is no difference between control and treatment. You compute a test statistic (often a z-test or t-test for means) and derive a p-value, the probability of observing a difference this large or larger if the null hypothesis were true. If the p-value falls below your significance threshold (typically $\alpha = 0.05$), you reject the null hypothesis and conclude the difference is real.
A confidence interval provides the same information with more nuance. A 95% confidence interval for the treatment effect tells you: if you repeated this experiment many times, 95% of the intervals constructed this way would contain the true effect. If the interval excludes zero, the result is statistically significant at the 5% level.
Sample size requirements follow directly from the math. To detect a 1% relative improvement in a metric with a baseline rate of 10%, at 80% power and 5% significance, you need roughly 15,000 users per group. Smaller effects require exponentially more users. This is why high-traffic products can test subtle changes quickly, while low-traffic products are limited to testing large effects.
Metric selection is a design decision that directly determines what you can learn from an experiment. A good primary metric is sensitive (moves when user experience changes), attributable (the treatment plausibly affects it), and timely (observable within the experiment window). Revenue per user is important but insensitive to small changes and slow to accumulate. Clicks per session is sensitive and fast but can be gamed by clickbait. The best primary metrics sit in the middle: metrics like "qualified engagement" (clicks that lead to at least 30 seconds of reading) capture genuine user value while responding quickly enough to detect within a reasonable experiment duration.
Statistical significance tells you whether an effect exists. It does not tell you whether the effect matters. A test with 10 million users per group can detect a 0.01% improvement with high confidence, but a 0.01% improvement in click-through rate is operationally meaningless if it costs three engineering months to maintain the new model.
Practical Significance
Practical significance asks: is the observed effect large enough to justify the cost of shipping and maintaining the change? This requires defining a minimum detectable effect (MDE) before the experiment starts. If your MDE is a 2% relative improvement in conversion rate, and the experiment shows a statistically significant 0.3% improvement, you do not ship. The effect is real but not worth the complexity.
The MDE should account for the full cost of the change: engineering time to build and maintain the new system, infrastructure cost differences, monitoring overhead, and the cognitive burden on the team of supporting another model variant. A 1% improvement in CTR sounds appealing in isolation, but if the new model requires a GPU-backed serving stack that costs $50,000 per month more than the existing CPU-based system, the improvement must generate more than $50,000 in incremental revenue to justify itself.
Type I and Type II Errors
A Type I error (false positive) happens when you conclude a treatment works, but it actually does not. Your significance threshold $\alpha$ directly controls this rate. At $\alpha = 0.05$, you accept a 5% chance of shipping a change that has no real effect. A Type II error (false negative) happens when you conclude a treatment does not work, but it actually does. The probability of avoiding a Type II error is called power, typically set at 80% or 90%. Low power means you frequently miss real improvements, which is just as costly as shipping fake ones because you lose the opportunity cost of a genuine win.
These two error types trade off against each other. Lowering $\alpha$ from 0.05 to 0.01 reduces false positives but increases false negatives because you now require stronger evidence to reject the null. Increasing power from 80% to 95% reduces false negatives but requires larger sample sizes, which means longer experiments and higher opportunity cost. The standard operating point of $\alpha = 0.05$ and power $= 0.80$ is a convention, not a law. Teams with high deployment costs (infrastructure changes, model retraining pipelines) should consider stricter $\alpha$ because false positives are expensive. Teams in fast-moving markets should consider higher power because missing a real improvement has immediate competitive consequences.