Designing an ML System End to End

Topics Covered

Requirements Gathering

The Requirements Checklist

Connecting Requirements to Architecture

Prioritizing Requirements

Common Requirements Traps

Architecture Selection

The Baseline Principle

Online vs Batch Architecture

Model Complexity Decisions

Avoiding Over-Engineering

Worked Example: Recommendation System Architecture

Data Strategy

Audit Existing Data Sources

Labels and Labeling Strategy

Feature Engineering

Data Pipelines and Quality

Worked Example: Data Strategy for Fraud Detection

Deployment Planning

Shadow Deployment

A/B Testing

Gradual Rollout

Rollback Plan

Deployment Checklist

Common Deployment Failures

Operating and Iterating

Monitoring Strategy

Retraining Strategy

Iteration Roadmap

Measuring What Matters

Cost Management

The System Design Interview Framework

Tying It All Together

Every ML system design starts with the same question: what prediction does the business actually need? Not "we want to use machine learning" — that is a solution looking for a problem. The right starting point is a business outcome expressed as a prediction task. "Reduce fraud losses by 30%" becomes "classify each transaction as fraudulent or legitimate within 50ms." "Increase engagement" becomes "rank content by predicted click probability for each user." The prediction task defines everything downstream: what data you need, how complex the model must be, how fast inference must run, and how you measure success. This translation from business goal to prediction task is where most ML projects fail. Teams jump to model architecture before defining what "good" looks like. A recommendation system with 95% precision but 200ms latency might be useless if the product requires sub-50ms responses. A fraud detector with 99.9% accuracy might be terrible if it catches only 40% of actual fraud (low recall on the minority class). Requirements must be specific, measurable, and validated with stakeholders before writing a single line of code.

The Requirements Checklist

What prediction? Define the exact output. Classification (fraud or not), regression (predicted revenue), ranking (ordered list of items), or generation (text, image). This determines model type, evaluation metrics, and serving pattern. Who consumes it? A downstream microservice needs a gRPC endpoint with structured output. A mobile app needs sub-100ms latency. A batch analytics pipeline needs predictions for millions of items overnight. The consumer shapes the serving architecture. What latency? Real-time serving (under 100ms) requires a dedicated model server, optimized inference, and possibly model distillation. Near-real-time (seconds) allows more complex models with batching. Batch (hours) removes latency constraints entirely and lets you use massive models. What accuracy? Define the metric and the target. For imbalanced problems like fraud detection, accuracy is meaningless — use precision-recall. For ranking, use NDCG or MAP. Set a minimum threshold that justifies the infrastructure cost. What scale? Ten predictions per second needs a single container. Ten thousand per second needs autoscaling, load balancing, and careful resource planning. Ten million per second needs a custom serving infrastructure. Scale determines cost, architecture complexity, and team size.

Interview Tip

In system design interviews, spend the first 5 minutes on requirements before touching architecture. Interviewers evaluate whether you ask the right questions. Clarifying latency, scale, and accuracy targets before designing shows you understand that the same business problem can have radically different solutions depending on constraints.

Connecting Requirements to Architecture

Requirements directly constrain architecture. A 10ms latency requirement eliminates any architecture that calls an external model API — network round-trip alone exceeds the budget. A 10-billion-item catalog eliminates architectures that compute recommendations on-the-fly for every request — you need precomputed candidates. A 99.99% uptime requirement demands redundant serving with automatic failover.

Prioritizing Requirements

Not all requirements carry equal weight. Latency and scale are hard constraints — a system that misses them is unusable regardless of model quality. Accuracy targets are softer — a model that achieves 88% when the target is 90% might still ship if the business impact is acceptable. Cost constraints are negotiable — a system that costs twice the budget but delivers three times the value might get approved. Prioritize requirements in this order: (1) hard constraints that make the system usable or unusable (latency, scale, uptime), (2) quality targets that determine business value (accuracy, precision, recall), (3) operational requirements that affect long-term sustainability (retraining frequency, monitoring, team size), (4) cost targets that affect financial viability.

Common Requirements Traps

Several requirements patterns trip up even experienced teams: Accuracy without context. "We need 95% accuracy" is meaningless without knowing the class distribution, the cost of false positives vs false negatives, and what baseline the model replaces. A spam classifier with 95% accuracy that lets through 5% of spam might be acceptable. A cancer screening model with 95% accuracy that misses 5% of cancers might be deadly. Anchor accuracy targets to business impact, not abstract percentages. Latency at the wrong percentile. A 100ms average latency sounds fast, but if p99 is 2 seconds, 1% of your users experience unacceptable delays — and those delays often correlate with complex, high-value queries. Specify latency at p95 and p99, not just the average or median. Scale for peak, not average. A system that handles 1,000 QPS on average might face 10,000 QPS during a product launch or seasonal spike. Design for peak load with autoscaling, or accept that performance will degrade during spikes and define the degradation strategy (queue requests, return cached predictions, fallback to rules). Ignoring cold start. Recommendation systems need historical data to work. What happens for new users with no history? New items with no interactions? The cold-start strategy (popular items, content-based recommendations, asking users for preferences) is a requirement, not an afterthought. Define the cold-start behavior during requirements gathering — it affects architecture choices (whether you need content-based features alongside collaborative filtering) and data collection (whether you need to ask new users for explicit preferences during onboarding). The mistake teams make is designing the "ideal" system without checking whether it fits the constraints. A deep learning model with attention layers might produce the best recommendations, but if your team has no GPU infrastructure and the budget is $500/month, a gradient boosted tree served on CPU is the right architecture.