Experiment Tracking and Model Registry

Course

ML Systems & Infrastructure

Experiment Tracking and Model Registry

Topics Covered

Experiment Tracking Fundamentals

What to log

Experiment comparison

Organizing experiments

Tools

The cost of not tracking

Hyperparameter Tuning

Grid search

Random search

Bayesian optimization

Early stopping

Search space design

Model Versioning and Registry

What a model registry stores

The staging lifecycle

Model cards

The reproducibility recipe

From Experiment to Production

Automated validation gates

Deployment strategies

Approval workflows

Rollback procedures

Lineage tracing

Experiment Tracking Fundamentals

Your team ran 47 experiments last month. Which learning rate produced the best F1 on the March 3rd validation set? Without experiment tracking, the answer is "nobody knows." The notebook was overwritten, the terminal output scrolled away, and the person who ran it is on vacation. This is not a hypothetical -- it happens at every ML team that relies on spreadsheets, Slack messages, or memory.

The core problem is that ML development is fundamentally different from software development. In traditional software, the code is the artifact -- you commit it, review it, deploy it. In ML, the artifact is the trained model, which depends on code, data, hyperparameters, random seeds, library versions, and hardware. Change any one of these and you get a different model. A git log tells you what code changed, but it says nothing about which data split, which augmentation pipeline, or which GPU produced the model sitting in production right now.

Experiment tracking solves this by recording everything that goes into producing a model, so any result can be reproduced, compared, or debugged months later. It is the ML equivalent of version control -- but for the entire training process, not just the code.

What to log

Every experiment should capture five categories of information:

Metrics over time. Not just the final loss, but the entire training curve -- loss per epoch, validation accuracy per epoch, learning rate schedule. This lets you diagnose whether a run diverged, overfit, or plateaued. A single number like "accuracy = 0.91" hides whether the model was still improving when training stopped or whether it peaked at epoch 12 and degraded for the remaining 38 epochs.

Parameters. Every hyperparameter and configuration value: learning rate, batch size, optimizer, architecture variant, dropout rate, data augmentation settings. If you cannot reproduce the result from the logged parameters alone, you are not logging enough. A useful test: give your logged parameters to a colleague who did not run the experiment. Can they reproduce it without asking you a single question? If not, you are missing something -- probably an implicit configuration like "I used the tokenizer from the pretrained checkpoint" or "I filtered out rows where the label was null."

Artifacts. The trained model weights, evaluation plots (confusion matrix, ROC curve, calibration plot), sample predictions, and the training configuration file. These are the outputs that a reviewer inspects when deciding whether to promote a model.

Code version. The exact git commit hash, plus any uncommitted changes (as a diff patch). Two experiments with the same hyperparameters but different code versions produce different models. Without the code version, you cannot tell whether a performance improvement came from a hyperparameter change or a bug fix that was committed between runs.

Data version. A hash or version identifier for the training and validation datasets. Data changes silently -- a new batch of labels arrives, a preprocessing bug is fixed, a data source adds new columns. If you do not version data alongside code, you will eventually face the nightmare of "this model worked last week with the same code and same hyperparameters, but now it does not, and nobody knows what changed."

Key Insight

The most common cause of irreproducible ML results is not random seeds or GPU non-determinism -- it is untracked data changes. A labeling team corrects 200 annotations, a feature pipeline starts emitting nulls, or a data source changes its schema. Without data versioning, these silent changes are invisible in your experiment logs.

Experiment comparison

Logging is only useful if you can compare. The typical workflow is: run 20 experiments varying learning rate and batch size, then pull up a comparison table sorted by validation F1. A parallel coordinates plot shows which hyperparameter combinations cluster near the top. A training curve overlay reveals that Run 14 achieved the best final metric but was still improving -- it should have trained longer -- while Run 7 converged faster and might be preferable under a training budget constraint.

The comparison view also catches mistakes. If Run 22 shows suspiciously high accuracy, you can inspect its parameters and discover it was accidentally evaluated on the training set instead of the validation set. Without structured comparison, this error survives until production.

Effective comparison requires standardization. If one engineer logs validation accuracy and another logs validation F1, you cannot rank their results in the same table. Teams should agree on a standard set of metrics, evaluation datasets, and logging frequencies before starting experiments. This upfront investment pays off immediately -- the first time someone asks "which of these 30 runs should we deploy?" the answer is a sorted table, not a three-hour meeting.

Organizing experiments

As experiment count grows, naming and organization become critical. A flat list of 500 runs is unusable. Most tracking tools support a hierarchical structure: projects contain experiments, experiments contain runs. A project might be "fraud-detection," with experiments for "architecture-search," "lr-tuning," and "data-augmentation-ablation." Each experiment contains the runs for that specific investigation.

Tags and metadata filters provide a second axis of organization. Tag runs with the team member who launched them, the GPU type used, the dataset variant, or the hypothesis being tested. Six months later, when someone asks "did we ever try training on the February data with a transformer backbone?" the answer is a tag filter, not a Slack search.

Tools

MLflow is the open-source default. It provides a tracking server, a model registry, and a serving layer. You log parameters and metrics with mlflow.log_param() and mlflow.log_metric(), and the UI gives you comparison tables and charts. It stores artifacts on local disk, S3, or GCS. The tradeoff is that self-hosting MLflow requires managing the backend database and artifact storage yourself.

Weights and Biases (W&B) is a managed service that adds collaborative features: team dashboards, report generation, and hyperparameter sweep orchestration. It logs system metrics (GPU utilization, memory) automatically. The tradeoff is cost and data residency -- your experiment data lives on their servers unless you use the self-hosted option.

Neptune occupies similar territory to W&B with a focus on metadata management and custom dashboards.

The choice between tools matters less than the discipline of using one consistently. A team that logs everything to MLflow will outperform a team that uses W&B for some experiments and spreadsheets for others. When evaluating tools, prioritize three criteria:

Instrumentation friction. How many lines of code does it take to add tracking to an existing training script? If the answer is more than 10, adoption will be inconsistent. Engineers will skip it when they are "just running a quick test" -- and those quick tests will produce the model that accidentally goes to production.
Query and comparison speed. Can you find all runs from last month with learning rate below 0.001 and sort them by validation F1 in under 5 seconds? If the UI is slow or the query language is limited, engineers will stop using the comparison features and fall back to spreadsheets.
Integration with downstream systems. Does the tracker integrate with your model registry, CI/CD pipeline, and serving infrastructure? If promoting a model from experiment to registry requires manual copy-paste of run IDs, someone will eventually paste the wrong one.

python

1import mlflow
2
3mlflow.set_experiment("fraud-detection-v2")
4with mlflow.start_run(run_name="lr-sweep-0.001"):
5    mlflow.log_param("learning_rate", 0.001)
6    mlflow.log_param("batch_size", 256)
7    mlflow.log_param("model_arch", "resnet50")
8    mlflow.log_param("data_version", "v2.3-2024-03-01")
9
10    for epoch in range(50):
11        train_loss, val_f1 = train_one_epoch(model, epoch)
12        mlflow.log_metric("train_loss", train_loss, step=epoch)
13        mlflow.log_metric("val_f1", val_f1, step=epoch)
14
15    mlflow.log_artifact("confusion_matrix.png")
16    mlflow.sklearn.log_model(model, "model")

The cost of not tracking

The cost of not tracking experiments is not abstract. Consider a concrete scenario: your production fraud detection model starts generating 30% more false positives. Customers are being blocked from legitimate transactions. The model was retrained last week with "the same configuration." Without experiment tracking, here is what happens: the team spends two days checking code diffs, re-running the training pipeline, and testing hypotheses. Eventually someone discovers that the data pipeline started including a new feature column three weeks ago, and the model learned a spurious correlation from that column. Two days of engineering time, plus the customer impact during investigation.

With experiment tracking, the investigation takes 15 minutes: compare the current production model's logged data version to the previous model's data version, see that the schema changed (new column), and trace the root cause directly. The tracking infrastructure paid for itself in a single incident.

The teams that move fastest in ML are not the ones with the most GPUs or the best researchers. They are the ones with complete experiment histories that let them build on past work instead of rediscovering it. Every experiment is an asset -- but only if it is recorded.

Hyperparameter Tuning

A model's hyperparameters -- learning rate, batch size, number of layers, dropout rate, weight decay -- are the knobs you turn before training begins. Unlike model parameters (weights and biases learned during training), hyperparameters are set by the engineer and never updated by the optimizer. The right combination can mean the difference between a model that barely beats a baseline and one that sets a new state of the art on your dataset.

The challenge is that the hyperparameter space is enormous. A model with 6 tunable hyperparameters, each with 10 candidate values, has one million possible configurations. Training each one takes hours. You cannot try them all. The question is not "which configuration is the best?" -- you will almost certainly never find the global optimum. The question is: how do you find a configuration that is good enough, quickly enough, to meet your deadline and budget?

This section covers three search strategies in order of sophistication, plus two meta-strategies (early stopping and search space design) that amplify any search method you choose.

Grid search

The simplest approach: define a discrete set of values for each hyperparameter and train every combination. If you search over 3 learning rates and 4 batch sizes, you train 12 models.

python

1# Grid search: 3 x 4 = 12 total experiments
2learning_rates = [0.001, 0.01, 0.1]
3batch_sizes = [32, 64, 128, 256]
4
5for lr in learning_rates:
6    for bs in batch_sizes:
7        model = train(lr=lr, batch_size=bs)
8        score = evaluate(model)
9        log_experiment(lr=lr, batch_size=bs, score=score)

Grid search is easy to understand and parallelize -- each combination is independent, so you can run all 12 on separate GPUs simultaneously. The problem is exponential growth. Add a third hyperparameter with 5 values and you have 60 experiments. Add a fourth with 3 values and you have 180. By the time you have 6 hyperparameters, you are looking at thousands of training runs, each costing real compute dollars. At $2/hour for a GPU instance, a 1,000-run grid search over a 6-hour training job costs $12,000 -- and most of those runs will produce mediocre results.

The deeper problem is that grid search wastes most of its budget on hyperparameters that do not matter. If learning rate dominates performance and batch size has minimal effect, a 3x4 grid spends 75% of its compute evaluating batch size variations that produce nearly identical results. You get only 4 unique measurements of each learning rate value -- regardless of whether batch size matters at all. This is a fundamental inefficiency: the budget is allocated uniformly across dimensions, but the importance of dimensions is almost never uniform.

Random search

Bergstra and Bengio showed in 2012 that random sampling from the hyperparameter space is surprisingly competitive with grid search -- and often better. Instead of evaluating every point on a grid, you sample random combinations from a defined distribution (uniform, log-uniform, categorical).

Grid vs Random vs Bayesian Hyperparameter Search Comparison

Why does random search work? The insight is beautifully simple: most hyperparameters are not equally important. In most ML problems, one or two hyperparameters dominate performance while the rest have minimal effect. If learning rate matters and batch size does not, a 9-point grid (3x3) gives you only 3 distinct learning rate values -- the remaining 6 experiments are wasted evaluating batch size variations that produce nearly identical results. Nine random samples give you 9 distinct learning rate values, covering the important dimension more densely. Random search automatically allocates more coverage to every dimension simultaneously because each sample is unique in all dimensions.

python

1import random
2
3# Random search: 12 experiments, better coverage
4for i in range(12):
5    lr = 10 ** random.uniform(-4, -1)   # log-uniform: 0.0001 to 0.1
6    bs = random.choice([32, 64, 128, 256])
7    dropout = random.uniform(0.1, 0.5)
8    model = train(lr=lr, batch_size=bs, dropout=dropout)
9    score = evaluate(model)
10    log_experiment(lr=lr, batch_size=bs, dropout=dropout, score=score)

The log-uniform distribution for learning rate is critical. Learning rates typically span orders of magnitude (0.0001 to 0.1), so uniform sampling would oversample the high end. Log-uniform gives equal probability to each order of magnitude. To implement log-uniform sampling, sample uniformly in log space and exponentiate: lr = 10 ** uniform(-4, -1) samples uniformly between $10^{-4}$ and $10^{-1}$ .

Interview Tip

When starting a new tuning project, run random search first to identify which hyperparameters matter. Look at the results sorted by performance and check which parameters vary among the top 10 runs. If all top runs have learning rate near 0.001 but batch size varies widely, learning rate is sensitive and batch size is not. Then narrow your search to the sensitive parameters.

Bayesian optimization

Random search treats each experiment as independent -- it does not learn from previous results. Bayesian optimization builds a probabilistic model (called a surrogate model) of the relationship between hyperparameters and performance, then uses that model to decide which configuration to try next.

The process works in a loop:

Run a few initial experiments (random or grid) to seed the surrogate model.
Fit the surrogate model (typically a Gaussian Process or Tree-structured Parzen Estimator) to the observed results.
Use an acquisition function (Expected Improvement, Upper Confidence Bound) to select the next hyperparameter configuration that balances exploration (trying uncertain regions) and exploitation (trying regions near known good results).
Train the model with the selected configuration, observe the result, and update the surrogate model.
Repeat until the budget is exhausted.

Bayesian optimization typically finds better configurations in fewer trials than random search because it directs the search toward promising regions. The tradeoff is sequential dependency: each trial depends on the results of previous trials, so parallelization is harder. There are batch variants (q-Expected Improvement) that suggest multiple configurations at once, but they sacrifice some of the efficiency gain.

Tools like Optuna, Ray Tune, and Ax implement Bayesian optimization with different surrogate models and parallelization strategies.

Gaussian Processes (GPs) fit a smooth surrogate function and provide uncertainty estimates at every point. They work well when the hyperparameter space is small (under 10 dimensions) and continuous, but they scale poorly -- fitting a GP to N observations takes O(N^3) time, so after a few hundred trials the surrogate model itself becomes a bottleneck.

Tree-structured Parzen Estimators (TPE), used by Optuna, model the probability of good outcomes and bad outcomes separately, then choose configurations that are likely under the "good" distribution and unlikely under the "bad" distribution. TPE handles categorical and conditional hyperparameters naturally (e.g., "if optimizer is Adam, tune beta1; if optimizer is SGD, tune momentum") and scales better than GPs to higher dimensions.

The practical advice: start with Optuna's TPE sampler for most problems. Switch to a GP-based method (Ax, BoTorch) when you have expensive evaluations (each trial takes days), a small continuous search space, and need maximum sample efficiency.

Early stopping

Not every training run deserves to finish. If a run's validation loss at epoch 5 is 3x worse than the best run at epoch 5, it is almost certainly not going to catch up by epoch 50. Early stopping (also called successive halving or Hyperband) terminates unpromising runs early and reallocates compute to more promising ones.

The Hyperband algorithm formalizes this: start many configurations with a small budget (few epochs), evaluate them, keep the top fraction, double their budget, and repeat. This way you evaluate 100 configurations at 1 epoch, keep the top 25 at 4 epochs, keep the top 6 at 16 epochs, and run only the best 2 to completion. The total compute is comparable to running 10 full training runs, but you have explored 100 configurations.

In practice, early stopping should be combined with your search strategy, not treated as a separate concern. Tools like Optuna integrate Hyperband-style pruning directly into their optimization loop: a Bayesian optimizer suggests configurations, and a pruner terminates underperforming trials before they consume their full budget. This combination -- intelligent selection of what to try next, plus aggressive early termination of bad configurations -- is the most compute-efficient approach available today.

Search space design

The search space itself is a design decision. Common mistakes include searching too broadly (learning rate from $1\mathrm{e}{-8}$ to $1.0$ when the reasonable range is $1\mathrm{e}{-5}$ to $1\mathrm{e}{-2}$ ), using linear scale for parameters that vary by orders of magnitude, and including irrelevant hyperparameters that add dimensions without improving results.

Start with published baselines for your model architecture. ResNets typically train well with learning rates around 0.1 with SGD or 0.001 with Adam. Use those as center points and search within one order of magnitude in each direction. Narrow the space after each round of experiments based on where the top results cluster.

Another common mistake is tuning too many hyperparameters simultaneously. If you have 8 possible knobs, do not search over all 8 at once. Start with the 2-3 that typically matter most for your architecture (learning rate and weight decay for transformers, learning rate and momentum for CNNs). Fix the rest at published defaults. Once you find good values for the critical parameters, do a second pass over the remaining ones. This staged approach covers the most impactful dimensions first and avoids wasting budget on combinations of irrelevant parameters.

Finally, document your search space decisions. Six months from now, when someone retrains this model on new data, they will ask "what hyperparameter ranges should I search?" If the answer is "check the experiment tracker and look at which ranges worked last time," that is a good system. If the answer is "ask the person who tuned it originally," that is a fragile system that breaks the first time someone leaves the team.

Model Versioning and Registry

Experiment tracking tells you which run produced a good model. A model registry tells you which model is in production, who approved it, and what happened to the three versions before it. These are different concerns: tracking is for the experimentation phase, the registry is for the deployment lifecycle.

Think of the model registry as the equivalent of a container registry (Docker Hub, ECR) but for ML models. Just as you would never deploy a Docker image by copying files from a developer's laptop, you should never deploy a model by copying weights from a Jupyter notebook. The registry provides a single source of truth for what is deployable.

Without a registry, model deployment degenerates into tribal knowledge. "The good model is in Sarah's /home/sarah/models/v3_final_FINAL/ directory." When Sarah goes on vacation or the machine is decommissioned, the model is lost. Even worse, without centralized versioning, two teams might deploy different model versions to different serving clusters without realizing it, creating inconsistent behavior across your product.

What a model registry stores

Each entry in the registry represents a versioned model with:

Model artifacts. The serialized weights, architecture definition, and any preprocessing components (tokenizer, feature scaler, label encoder) needed to run inference. A model without its preprocessing pipeline is useless -- if the training pipeline normalized inputs to zero mean and unit variance, the serving pipeline must apply the same normalization with the same statistics. This is such a common source of bugs that many teams adopt a "model bundle" pattern: serialize the model and all preprocessing steps into a single artifact (MLflow's log_model() with a custom PythonModel wrapper, or a TorchServe archive). The bundle guarantees that deployment cannot accidentally separate the model from its preprocessing.

Metadata. Training run ID (linking back to the experiment tracker), training dataset version, framework version (PyTorch 2.1.0), performance metrics on standardized evaluation sets, model size, inference latency on reference hardware, and the author who registered the model. Metadata is what makes the registry searchable: "show me all fraud detection models registered in the last 30 days with F1 above 0.92 and p99 latency under 100ms."

Lineage. The complete chain from raw data to deployed model: which data pipeline produced the training set, which training script ran, which experiment run produced the weights, and which CI/CD pipeline built the serving container. Lineage answers "why is this model making bad predictions?" by letting you trace backward from the production model to its training data. Without lineage, a model in the registry is an opaque binary blob -- you know what it does (from evaluation metrics) but not why it does it (from training provenance).

The staging lifecycle

Models move through stages, each with different access controls and quality gates:

Development -- the model was registered from an experiment run. It exists in the registry but has not been evaluated against production standards. Any team member can register a Development model.

Staging -- the model passed automated validation (accuracy thresholds, latency benchmarks, bias checks) and is a candidate for production. Promoting to Staging typically requires review by a second team member.

Production -- the model is serving live traffic. Only one version (or a small set for A/B testing) should be in Production at any time. Promotion to Production usually requires sign-off from both ML engineers and product stakeholders.

Archived -- a previously Production model that has been replaced. Archived models are retained for rollback capability and audit trails. Never delete archived models within the retention window -- you will need them when the new production model regresses and you need to revert within minutes.

Each stage transition should be an auditable event with a timestamp, the user who initiated it, and the reason. "Promoted to Staging because validation F1 improved by 2.3% over current production model, passing all automated gates" is a useful transition note. "Promoted to Staging" is not. These notes become the institutional memory of your model deployment history -- six months later, they answer "why did we switch models in July?"

1Development --> Staging --> Production --> Archived
2    |              |            |
3    |              |            +--> Rollback target
4    |              +--> Rejected (back to Development)
5    +--> Abandoned (never promoted)

Model cards

A model card is a structured document that accompanies each registered model. It describes what the model does, how it was trained, and where it should and should not be used. Model cards are not optional paperwork -- they are the primary communication channel between the ML team and the downstream consumers of the model.

A model card should include:

Intended use case and out-of-scope uses. "This model classifies customer support tickets into 12 categories. It was not trained on or evaluated for legal document classification."
Training data description. Sources, size, date range, known biases or gaps. "Trained on 500K tickets from January-June 2024. Under-represents tickets from the APAC region (3% of training data vs. 15% of production traffic)."
Evaluation metrics broken down by category, language, or demographic subgroup where applicable.
Known limitations. "Accuracy drops below 70% on inputs shorter than 10 tokens. Misclassifies billing inquiries as technical support 8% of the time."
Ethical considerations and the recommended update cadence.

Interview Tip

When writing a model card, optimize for the reader who will be debugging this model at 2 AM during an incident. They need to know: what data was this trained on, what are the known failure modes, and what is the rollback procedure. Skip the marketing language and focus on actionable operational information.

The reproducibility recipe

A registered model should be reproducible from its metadata alone. The reproducibility recipe has four ingredients:

Frozen dependencies. A requirements.txt or conda.yaml with pinned versions for every library. "pytorch>=2.0" is not reproducible. "pytorch==2.1.0" with "cuda==11.8" and "numpy==1.24.3" is reproducible. Better yet, link to the exact Docker image used for training.

Data snapshot. A pointer to the immutable version of the training data -- a DVC hash, an S3 path with versioning enabled, or a dataset registry entry. The data must be retrievable months later, not just "whatever was in the /data directory at the time." Tools like DVC (Data Version Control) extend Git's versioning model to large files and datasets: you commit a small metadata file to Git, and DVC stores the actual data in remote storage (S3, GCS, Azure Blob). The combination of a Git commit hash and a DVC pointer uniquely identifies both the code and the data used for any training run.

Random seed. The seed used for weight initialization, data shuffling, and any stochastic components. Combined with deterministic CUDA operations (when available), this makes the training run bit-for-bit reproducible on the same hardware. Note that full determinism across different GPU architectures is not always achievable -- some CUDA kernels use non-deterministic algorithms by default for performance. Set torch.use_deterministic_algorithms(True) if exact reproducibility is required, but expect a 10-20% slowdown.

Code version. The git commit hash plus a record of any uncommitted changes. If the model was trained from a dirty working directory, the diff of uncommitted changes must be stored alongside the commit hash.

Together, these four elements let anyone on the team recreate the exact model months or years later -- essential for debugging, auditing, and regulatory compliance.

A common objection is that perfect reproducibility is too expensive: storing data snapshots takes space, pinning dependencies creates maintenance burden, and deterministic CUDA operations are slower. This is true, but the alternative is worse. You do not need perfect reproducibility for every experiment -- you need it for every model that enters production. Make the reproducibility recipe a requirement for promoting a model to Staging, not for starting a training run. This concentrates the effort where it matters and lets researchers iterate freely during exploration.

From Experiment to Production

The gap between "this model looks good in my notebook" and "this model is serving 10,000 requests per second in production" is where most ML projects fail. A model that achieves 95% accuracy on a held-out test set might have 300ms inference latency (breaking the 100ms SLA), exhibit bias against a demographic group (creating legal liability), or crash on edge-case inputs that never appeared in training data. The path from experiment to production must include systematic validation gates that catch these problems before they reach users.

Automated validation gates

Before a model can move from Staging to Production, it must pass a suite of automated checks. These are not suggestions -- they are hard gates that block deployment if any check fails.

Accuracy threshold. The model must meet or exceed a minimum performance metric on a standardized evaluation set. This set is versioned and fixed -- it does not change between model versions, so comparisons are valid. The threshold is usually set relative to the current production model: "new model must be within 1% absolute accuracy of the current model on the standard eval set." A new model that is 5% better on a different eval set but 2% worse on the standard one does not pass. This prevents "metric shopping" -- the temptation to switch evaluation sets until you find one where the new model looks better.

Latency budget. Measure inference latency at the 50th, 95th, and 99th percentiles on reference hardware matching the production environment. A model that meets accuracy thresholds but has p99 latency of 500ms when the SLA requires 200ms is not production-ready. This gate catches architecture changes (bigger model, more layers) that improve accuracy at the cost of speed.

Bias checks. Evaluate model performance across demographic subgroups, geographic regions, or input categories. A hiring model that has 90% overall accuracy but 60% accuracy for a protected group fails the bias gate even though its aggregate metrics look strong. These checks are not just ethical -- they are increasingly required by regulation (EU AI Act, NYC Local Law 144).

Bias checks should be defined once and applied consistently to every model version. The common mistake is running bias checks manually and inconsistently -- checking one model against gender subgroups but forgetting to check the next model against age brackets. Encode the full set of subgroup evaluations into the automated pipeline so every model version gets the same scrutiny.

Input validation. Feed the model adversarial, malformed, and edge-case inputs: empty strings, extremely long inputs, special characters, inputs in unexpected languages or formats. The model should return graceful predictions or structured error responses, never crash or hang.

Resource consumption. Verify that the model's memory footprint fits within the serving infrastructure's limits. A model that requires 8 GB of GPU memory cannot deploy to instances with 4 GB GPUs. This gate is often overlooked because development machines have more resources than production instances. Check peak memory during inference (not just model loading), as some architectures allocate temporary buffers during forward passes that push memory usage above the static model size.

These gates should run in an automated CI/CD pipeline triggered by every promotion request. Manual gates slow down iteration and are skipped under deadline pressure -- exactly when they are most needed.

Deployment strategies

Once a model passes all validation gates, you do not flip a switch and send 100% of traffic to it. Production deployment is gradual and reversible.

Shadow deployment. Route production traffic to both the current model and the new model, but only serve responses from the current model. Log the new model's predictions for offline comparison. This reveals performance differences on real production data -- which may differ from the evaluation set -- without any user impact. Run shadow deployment for days or weeks depending on traffic volume and distribution shifts. The key advantage is zero risk: since the new model's predictions are never shown to users, any failure -- crashes, degraded accuracy, increased latency -- is detected in logs without affecting a single user experience.

Canary rollout. Route a small percentage of traffic (1-5%) to the new model and monitor key metrics: prediction latency, error rate, business KPIs (click-through rate, conversion rate). If metrics hold, gradually increase traffic (5% to 10% to 25% to 50% to 100%) over hours or days. If any metric degrades beyond a configurable threshold, automatically route all traffic back to the old model. The canary percentage and promotion schedule should be codified in a deployment configuration, not decided ad-hoc by the engineer on call.

A/B testing. When you need statistically rigorous comparison, split traffic 50/50 between the old and new model and measure business outcomes. This requires enough traffic to achieve statistical significance, which may take days for low-traffic endpoints. A/B testing is heavier than canary rollouts but provides causal evidence that the new model improves business metrics, not just ML metrics.

The typical progression is shadow first, then canary, then (optionally) A/B test. Shadow mode answers "does the new model behave reasonably?" Canary mode answers "does the new model work in production without breaking anything?" A/B testing answers "does the new model measurably improve outcomes?" Not every model update needs all three stages -- a minor retraining on fresh data might skip A/B testing and go straight from shadow to canary to full rollout. A major architecture change should use all three.

Approval workflows

Not every deployment should be fully automated. High-stakes models (credit scoring, medical diagnosis, content moderation) benefit from human approval gates between Staging and Production. The workflow typically requires sign-off from an ML engineer (model quality), a product manager (business impact), and in regulated industries, a compliance officer (regulatory requirements).

The approval process should be lightweight -- a checklist backed by the automated validation results, not a multi-day committee review. If the automated gates are well-designed, human review focuses on judgment calls that automation cannot make: "Is this model's performance improvement worth the increased latency?" or "Does this model's behavior on edge cases align with our product values?"

The worst pattern is an approval workflow that requires a committee meeting. If deploying a model requires scheduling a meeting with five stakeholders, the team will deploy less frequently, which means larger changes per deployment, which means higher risk per deployment. Keep the approval asynchronous -- a reviewer examines the automated gate results, the model card, and sample predictions, then approves or rejects with written feedback. Target a 24-hour turnaround, not a weekly meeting cycle.

Key Insight

The most dangerous deployment failure is not a model that crashes -- it is a model that serves confidently wrong predictions. A crashing model triggers alerts immediately. A subtly degraded model can serve bad predictions for weeks before anyone notices, because the predictions look plausible and no error is thrown. This is why shadow deployment and canary rollouts with business metric monitoring are essential -- they catch degradation that accuracy metrics alone miss.

Rollback procedures

Every production model deployment must have a documented, tested rollback procedure. "Tested" means someone has actually executed a rollback in a staging environment, not just written the steps in a runbook.

The rollback procedure should answer five questions. Where is the previous model version stored? (The registry, in Archived stage.) How long does it take to swap? (Seconds if using a model serving platform with version switching; minutes if redeploying a container.) Who has permission to trigger a rollback? (On-call engineers, without needing manager approval during an incident.) What monitoring confirms the rollback succeeded? (Prediction latency returns to baseline, error rate drops, business metrics stabilize.) And finally, what is the blast radius if the rollback itself fails? (Have a fallback plan -- even if it is "disable the feature entirely and return a default response.")

A common mistake is treating rollback as a rare emergency procedure. In mature ML systems, rollback is routine. Models are deployed and rolled back during shadow testing, canary evaluation, and A/B tests. The rollback path should be exercised weekly, not tested for the first time during a 2 AM incident.

Lineage tracing

When a production model makes a bad prediction, you need to trace backward: which model version produced it, which training run created that model, which dataset was used, which preprocessing pipeline transformed the data, and which raw data sources fed the pipeline. This chain -- from prediction to raw data -- is lineage.

Lineage tracing is not just for debugging. It serves three distinct purposes:

Debugging. When predictions go wrong, trace backward from the serving log to the training data to find the root cause.
Compliance. Regulatory frameworks (GDPR right to explanation, financial model audit requirements, EU AI Act) may require demonstrating exactly how a prediction was produced, including the training data that influenced it.
Impact analysis. When a data source is discovered to contain errors, lineage tells you which models were trained on that data and need to be retrained. Without lineage, you must assume every model is potentially affected and retrain everything -- an expensive and unnecessary response.

Build lineage by connecting identifiers across systems: the serving log records model_version_id, the registry maps that to experiment_run_id, the experiment tracker maps that to data_version and code_commit, and the data pipeline maps data_version to the specific raw data files. Each system only needs to record one link in the chain; the full trace is assembled on demand by following the links.

In practice, lineage tracing requires discipline at every stage of the pipeline. The data team must tag each dataset with a version. The training script must record which dataset version it consumed. The registry must store which training run produced each model. And the serving infrastructure must log which model version handled each request. If any link is missing, the chain breaks and you cannot trace end-to-end. The most common gap is between the data pipeline and the training script -- data engineers and ML engineers often work in separate systems, and connecting their identifiers requires explicit integration.

When lineage tracing works, debugging becomes systematic instead of heroic. "The model gave a bad prediction for user X at time T" becomes a tractable query: look up the model version from the serving log, find its training run in the registry, check the training data version, and examine whether the data distribution at training time matches the input that caused the bad prediction. Without lineage, this same investigation requires interviewing three teams and searching five systems manually.

Course

ML Systems & Infrastructure

ML Fundamentals for Engineers

Data Infrastructure

Training Infrastructure

Model Serving

ML Applications

Evaluation and Testing

Production Operations

Specialized Systems and Capstone