Model Monitoring and Drift Detection

Topics Covered

Why Models Degrade

Three Categories of Model Degradation

Why Monitoring Is Existential

Data Drift Detection

Population Stability Index (PSI)

Kolmogorov-Smirnov Test

Wasserstein Distance (Earth Mover's Distance)

Per-Feature vs Aggregate Monitoring

Windowed Comparison Strategy

Concept Drift Detection

Types of Concept Drift

Detection Without Ground Truth Labels

The Label Delay Problem

Performance Monitoring

Model-Specific Metrics

Infrastructure Metrics

SLOs for ML Models

Connecting Model Metrics to Business Outcomes

Retraining Strategies

Scheduled Retraining

Triggered Retraining

Continuous Training

Validation Gates

A/B Testing Retrained Models

Warm-Starting vs Training from Scratch

Data Windowing

A machine learning model is a frozen snapshot of the world at the moment it was trained. Every statistical relationship it learned, every pattern it memorized, every decision boundary it drew reflects the data that existed during training. The world, however, does not freeze. Customer behavior shifts, market conditions change, upstream data pipelines get refactored, and entirely new categories of events appear that the model has never seen. The model keeps predicting as if nothing changed, and its predictions quietly become wrong.

This is not a theoretical concern. COVID-19 broke every demand forecasting model overnight. Customer behavior shifted so dramatically that models trained on 2019 data were predicting a world that no longer existed. Grocery delivery demand spiked 300 percent, airline booking models predicted normal seasonal travel, and fraud detection models flagged legitimate pandemic-era purchasing patterns as suspicious. Companies that had no monitoring in place did not realize their models were failing until business metrics cratered weeks later.

Three Categories of Model Degradation

Data drift happens when the statistical distribution of input features changes over time. A credit scoring model trained on applications from 2020 sees a very different applicant pool in 2024. Income distributions shift, new employment categories appear, and the mix of applicant demographics changes. The model itself has not changed, but the data it receives no longer looks like the data it was trained on.

Concept drift is more subtle and more dangerous. The relationship between inputs and the correct output changes. A fraud detection model might have learned that transactions over $5,000 from new accounts are suspicious. But fraudsters adapt. They start making many small transactions instead. The input features (transaction amount, account age) still fall within the training distribution, but the correct label for those inputs has changed. The model is confidently wrong.

Upstream data pipeline changes are the most common cause of sudden model failures in production. An upstream team renames a column, changes a unit from dollars to cents, or modifies how a categorical feature is encoded. The model receives syntactically valid data that is semantically different from what it expects. A feature that used to represent "days since last purchase" now represents "hours since last purchase," and the model has no way to know.

Why Monitoring Is Existential

Without monitoring, model degradation is invisible. Unlike a crashed server that triggers an alert or a failed database query that returns an error, a degraded model still returns predictions. It returns 200 OK responses with confidently wrong answers. No exception is thrown, no error log is written. The model silently serves bad predictions, and the damage compounds until someone notices business metrics declining and traces the problem back to the model, often weeks or months later.

Monitoring transforms model degradation from an invisible, slow-burning problem into a detectable, actionable signal. It is the difference between discovering a model failure in minutes and discovering it in months.

Key Insight

In interviews, candidates often focus on model accuracy at training time and neglect post-deployment monitoring entirely. Demonstrating that you understand models degrade and that monitoring is a first-class production concern, not an afterthought, signals real-world ML experience.