The ML Development Lifecycle

Topics Covered

Problem Framing

From Business Goal to ML Task

Do You Even Need ML?

Defining Success Metrics

Data Collection and Preparation

Data Collection

Data Cleaning

Feature Engineering

Training and Validation

Train, Validation, and Test Splits

Hyperparameter Tuning

Model Versioning and Reproducibility

Deployment and Iteration

Deployment Strategies

The Hidden Technical Debt in ML Systems

Monitoring and Retraining Triggers

The most expensive mistake in ML is solving the wrong problem. Before writing a single line of model code, you must convert a vague business goal ("reduce fraud," "improve recommendations," "automate support") into a precise ML task specification with measurable success criteria, latency constraints, and data requirements.

From Business Goal to ML Task

A business goal like "reduce fraud" is not an ML task. An ML task is: "classify each transaction as fraudulent or legitimate within 100ms, achieving at least 95% recall at 80% precision, using features available at transaction time." Every word in that specification matters:

Task type — classification (fraudulent vs legitimate), not regression, not ranking. This determines the model architecture, loss function, and evaluation metrics.

Latency constraint — 100ms means the model must run inline during transaction processing, not as an overnight batch job. This constrains the model complexity and serving architecture.

Success criteria — 95% recall at 80% precision quantifies the acceptable trade-off between missed fraud and false alarms, tied to business cost. Without this, "better" is undefined.

Feature availability — "features available at transaction time" means no future information. You cannot use whether a chargeback was filed (that happens weeks later) as an input feature during live scoring.

Key Insight

Most ML projects fail not because the model is bad, but because the problem was poorly framed. A perfectly accurate model that runs in 10 seconds is useless for a 100ms latency requirement. A model optimized for accuracy instead of recall lets fraud through. Get the framing right before touching data.

Do You Even Need ML?

Not every problem needs ML. Rule-based systems are simpler, more interpretable, and easier to debug. ML adds value when:

The patterns are too complex for rules — 500 hand-written fraud rules catch known patterns. ML catches unknown patterns by learning from millions of examples.

The patterns change over time — fraud tactics evolve. A rule-based system requires manual updates. An ML model can be retrained on new data automatically.

The data is unstructured — rules cannot classify images, parse natural language, or interpret audio. ML excels at unstructured data.

If a simple heuristic solves 90% of the problem, start there. Layer ML on top for the remaining 10%. The hybrid approach (rules for known patterns, ML for novel ones) is extremely common in production — fraud detection, content moderation, and spam filtering all use rule-ML hybrids.

Defining Success Metrics

Every ML project needs two types of metrics:

Model metrics measure the model's technical performance: precision, recall, F1, AUC, NDCG, latency percentiles. These are what data scientists track during development.

Business metrics measure real-world impact: fraud losses reduced, customer support tickets deflected, revenue from better recommendations, user engagement changes. These are what matter to stakeholders.

The relationship between model metrics and business metrics is not always linear. A 5% improvement in model accuracy might produce a 20% improvement in fraud losses — or zero improvement if the accuracy gain is on easy cases the rule-based system already catches. Define both, track both, and validate that improving model metrics actually improves business metrics through A/B testing.

Data quality determines the ceiling of model quality. No amount of model sophistication can overcome bad data. A simpler model on clean, well-labeled data will outperform a complex model on noisy, biased data every time. This is why ML teams spend 60-80% of their time on data work — and why data infrastructure is the highest-leverage investment in an ML platform.

Data Collection

Data comes from three sources:

Organic data is generated by your product naturally — user clicks, purchases, searches, page views, session logs. This is the most abundant source and requires no additional cost to collect, but it comes with selection bias (you only see data from users who use your product) and implicit labels (you observe behavior, not intent).

Labeled data requires human annotation — a person examines each example and assigns a label. This is expensive ($0.01-$10 per label depending on task complexity), slow (weeks to months for large datasets), and quality-dependent on annotator expertise and guidelines. But it is the gold standard for supervised learning.

Synthetic data is generated programmatically — augmenting real examples (rotating images, paraphrasing text) or creating entirely new examples using rules, simulations, or generative models. Synthetic data can be unlimited in quantity but may not capture the full complexity of real-world distributions.

Data Cleaning

Real-world data is messy. Cleaning is not optional — it is the highest-return activity in any ML project:

Missing values — a feature missing for 5% of examples is manageable (impute with median or a learned value). A feature missing for 50% is either useless or needs a separate data source.

Duplicates — exact and near-duplicates inflate metrics by leaking training examples into the test set. Deduplication is especially critical for text data scraped from the web.

Label noise — human annotators disagree, make mistakes, and vary in quality. A 10% label noise rate caps model accuracy at roughly 90% regardless of model sophistication. Measuring inter-annotator agreement (Cohen's kappa) quantifies this ceiling.

Distribution shifts — data collected in January may not represent July traffic. Seasonal products, trending events, and user behavior changes make older data less representative. Using a time-based split (train on older data, test on recent data) reveals how well the model handles temporal shifts.

Common Pitfall

The most dangerous data bug is one you cannot see in aggregate statistics. A model trained on US-centric data performs well on average metrics but fails on international users. A model trained on weekday data degrades on weekends. Always slice evaluation metrics by important dimensions — geography, time, user segment, device type — to catch hidden failures.

Feature Engineering

Raw data is rarely suitable as direct model input. Feature engineering transforms raw data into representations that make patterns easier for models to learn:

Normalization — scaling numeric features to comparable ranges (e.g., 0-1 or mean 0, std 1) so that features with large absolute values (salary in dollars) do not dominate features with small values (age in years).

Encoding — converting categorical variables (country, device type, product category) into numeric representations. One-hot encoding creates a binary column per category. Embeddings learn dense vector representations for high-cardinality categories.

Aggregation — computing statistics over time windows. "Number of purchases in the last 7 days" is more informative than a raw list of purchase timestamps. These windowed aggregations are the backbone of feature engineering for user behavior models.

Interaction features — combining features to capture relationships. "Price per square foot" is more informative than price and square footage separately for a house price model.

Training is where the model learns from data. But a model that trains well is not necessarily a model that generalizes well. The training process must be structured to detect overfitting early, find the best configuration efficiently, and produce a model you can reproduce and audit months later.

Train, Validation, and Test Splits

The three-way split serves distinct purposes:

Training set (70-80%) — the data the model learns from. It sees this data repeatedly across many epochs, adjusting its parameters to minimize prediction error.

Validation set (10-15%) — used during training to detect overfitting and tune hyperparameters. The model never trains on this data, but you evaluate performance on it after each epoch and make decisions based on the results (when to stop training, which hyperparameters work best). Because you make decisions based on validation performance, there is a risk of indirectly overfitting to the validation set through hyperparameter selection.

Test set (10-15%) — used once at the end to get an unbiased performance estimate. The test set is the "final exam" — it represents how the model will perform on truly unseen data. Using it multiple times or peeking at it during development invalidates this guarantee.

For small datasets (under 10,000 examples), use k-fold cross-validation: split data into k folds, train on k-1 folds, test on the held-out fold, rotate k times, average the results. This gives a more reliable estimate than a single split by using all data for both training and evaluation.

Hyperparameter Tuning

Hyperparameters are the settings you choose before training begins — learning rate, batch size, number of layers, regularization strength, dropout rate. The right combination can mean the difference between a model that achieves 85% accuracy and one that achieves 92%.

Grid search evaluates every combination of predefined values. If you try 5 learning rates and 4 batch sizes, you run 20 training jobs. Exhaustive but expensive — cost grows exponentially with the number of hyperparameters.

Random search samples hyperparameter combinations randomly. Surprisingly, this often outperforms grid search because it explores more distinct values per hyperparameter. With a budget of 20 runs, random search samples 20 different learning rates versus grid search's 5.

Bayesian optimization (Optuna, Ray Tune) uses results from previous runs to intelligently choose the next configuration to try. If low learning rates have consistently underperformed, it focuses on higher rates. This finds good configurations in fewer runs than random search — critical when each run costs hundreds or thousands of dollars in GPU time.

Interview Tip

In practice, the most impactful hyperparameters are learning rate (try 1e-5 to 1e-2 on a log scale), batch size (32, 64, 128, 256), and regularization strength (weight decay from 0 to 0.1). These three account for most of the tuning gains. Do not waste compute tuning dozens of hyperparameters — focus on the big three and use default values for everything else as a starting point.

Model Versioning and Reproducibility

Every trained model should be fully reproducible — meaning someone can rerun your exact training configuration and get the same (or very similar) result. This requires tracking:

Code version — the exact git commit of training code. Different code versions may preprocess data differently, use different model architectures, or have different bugs.

Data version — which dataset (and which version of that dataset) was used. If the data changes, the model changes. Tools like DVC (Data Version Control) track dataset versions alongside code.

Hyperparameters — every configuration value used during training. Experiment tracking tools (MLflow, Weights & Biases) log these automatically.

Random seeds — neural network training involves randomness (weight initialization, data shuffling, dropout). Setting deterministic seeds enables exact reproduction.

Environment — Python version, library versions, GPU type, CUDA version. A model trained with PyTorch 2.0 may produce different results with PyTorch 2.1 due to numerical differences in underlying operations.

Experiment tracking tools automate this by logging every training run with its full configuration and results. When a model regresses in production six months later, you can trace back to the exact experiment that produced it and compare against the new training run to identify what changed.

Deploying a model to production is not the end of the ML lifecycle — it is the beginning of operations. The model will degrade over time as the world changes. Deployment must be safe (catch bad models before they reach all users), monitored (detect degradation before it impacts business), and iterative (retrain and redeploy regularly).

ML lifecycle wheel showing continuous cycle of data, train, evaluate, deploy, monitor, retrain

Deployment Strategies

A new model should never go from training directly to serving 100% of traffic. Deployment is a gradual process with checkpoints at each stage:

Shadow mode — the new model runs alongside the production model, processing the same requests and producing predictions, but only the production model's predictions are served to users. The new model's predictions are logged and compared to the production model. This validates that the new model behaves reasonably on real traffic without any user impact. Duration: 1-7 days.

Canary deployment — a small percentage of traffic (1-5%) is routed to the new model. Real users see the new model's predictions, but the blast radius is limited. If metrics degrade, the canary is killed and 100% of traffic reverts to the old model. Duration: 1-3 days per traffic tier.

A/B testing — traffic is split 50/50 between old and new models. This provides statistical power to measure the business impact of the new model. Duration: 1-4 weeks depending on traffic volume and the minimum detectable effect you need to measure.

Full rollout — after validating through shadow, canary, and A/B testing, the new model serves 100% of traffic. The old model is kept available as a rollback target for at least one retraining cycle.

Deployment progression from shadow mode through canary and A/B test to full rollout

The Hidden Technical Debt in ML Systems

Google's influential 2015 paper describes why ML systems accumulate technical debt faster than traditional software:

ML code as a small fraction of the total system, surrounded by much larger infrastructure components

The ML code is a small fraction of the total system. The model itself might be 5% of the codebase. The other 95% is data pipelines, feature engineering, serving infrastructure, monitoring, configuration, and glue code connecting everything together.

Data dependencies are harder to track than code dependencies. When a team changes how they compute a feature, every downstream model that uses that feature is affected — but there is no compiler error or test failure to alert you. Silent data dependency changes are the ML equivalent of undocumented API changes.

Feedback loops create instability. A recommendation model influences which items users see, which changes user behavior, which changes the training data, which changes the model. These feedback loops can amplify biases, create filter bubbles, or cause oscillating behavior where the model alternates between two states.

ML system growing from clean architecture to tangled web of dependencies over time

Monitoring and Retraining Triggers

A model in production must be monitored continuously:

Prediction distribution monitoring — if the model suddenly starts predicting "fraud" for 30% of transactions instead of the usual 0.5%, something is wrong — either the input data changed or the model is broken.

Feature drift — if the distribution of input features shifts significantly from what the model was trained on, predictions become unreliable. Statistical tests (KS test, Population Stability Index) compare current feature distributions against training distributions.

Business metric monitoring — the ultimate check. If the fraud model's precision drops from 85% to 60% (measured against confirmed fraud labels that arrive weeks later), retraining is needed.

Retraining can be scheduled (retrain every week with the latest data regardless of performance), triggered (retrain when drift metrics exceed thresholds), or continuous (update the model incrementally with each new batch of data). Most production systems use scheduled retraining with triggered emergency retraining as a safety net.