ML Systems & Infrastructure
Data Infrastructure
Training Infrastructure
Model Serving
Evaluation and Testing
Production Operations
Specialized Systems and Capstone
Machine Learning in a Nutshell
Machine learning is pattern recognition from data. Instead of writing explicit rules ("if transaction amount > $10,000, flag as fraud"), you give the system thousands of examples of fraudulent and legitimate transactions and let it learn the patterns that distinguish them. The model discovers rules that a human programmer could never enumerate — combinations of amount, time of day, merchant category, device fingerprint, and velocity that together signal fraud.
Every ML system has two distinct phases. Training is where the model learns: it processes historical data, makes predictions, compares those predictions against known correct answers, and adjusts its internal parameters to reduce errors. Inference is where the model works: it receives new, unseen inputs and produces predictions based on what it learned during training. Training is computationally expensive and happens periodically. Inference is cheap per-prediction and happens continuously.

Why does this distinction matter for infrastructure? Training demands GPUs, large datasets, and hours (or days) of compute. Inference demands low latency, high availability, and the ability to handle thousands of requests per second. The infrastructure for each phase looks completely different — training runs on GPU clusters with batch schedulers, while inference runs on model servers behind load balancers. Every infrastructure decision you will encounter in this course maps to one of these two phases.
The model itself is a mathematical function with adjustable parameters. During training, an algorithm called gradient descent iteratively tweaks those parameters to minimize a loss function — a measure of how wrong the model's predictions are. When the loss is low, the model's predictions are close to the correct answers. When the loss is high, the model needs more training or a different approach.
ML is not magic — it is curve fitting at scale. A model learns a function that maps inputs to outputs by adjusting parameters to minimize prediction error. Understanding this demystifies every ML concept that follows: overfitting is when the function is too complex for the data, regularization is penalizing complexity, and evaluation metrics measure how well the function generalizes to inputs it has never seen.
Think of it this way: a traditional program takes data and rules as input and produces answers. A machine learning program takes data and answers as input and produces rules. Those learned rules are the model — and the quality of those rules depends entirely on the quality and quantity of the data you feed in.
The fundamental split in ML is whether you have labeled data. Supervised learning means every training example comes with the correct answer attached. "This email is spam." "This transaction is fraudulent." "This house sold for $450,000." The model learns to predict those labels from the input features. Unsupervised learning means you have data without labels. The model finds structure on its own — clusters, patterns, anomalies — without being told what to look for.
Supervised Learning
Supervised learning solves two types of problems:
Classification assigns discrete categories to inputs. Is this email spam or not spam? Is this image a cat, dog, or bird? Is this patient's tumor malignant or benign? The model outputs a probability distribution across categories, and you pick the highest-probability category (or apply a threshold for binary decisions).
Regression predicts continuous numeric values. What will this house sell for? How many minutes will this delivery take? What will tomorrow's temperature be? The model outputs a single number — the predicted value.

The distinction matters because the evaluation metrics, loss functions, and model architectures differ. Classification uses cross-entropy loss and measures accuracy, precision, and recall. Regression uses mean squared error and measures MAE or RMSE. Choosing the wrong problem type — treating a regression problem as classification or vice versa — is a common and expensive mistake.
Unsupervised Learning
Unsupervised learning finds structure in unlabeled data:
Clustering groups similar data points together. K-means partitions customers into segments based on purchasing behavior. DBSCAN finds dense regions of data points that form natural groups. The algorithm discovers that your customers form 5 distinct behavioral clusters — you did not tell it how many clusters to find or what they mean.
Dimensionality reduction compresses high-dimensional data into fewer dimensions while preserving structure. PCA (Principal Component Analysis) projects 100 features down to 10 that capture most of the variance. t-SNE and UMAP create 2D visualizations of high-dimensional data, revealing clusters and relationships invisible in raw feature space.
Anomaly detection identifies data points that do not fit the learned distribution. An autoencoder trained on normal network traffic learns what "normal" looks like. When it encounters unusual traffic patterns, its reconstruction error spikes — flagging the anomaly without ever seeing labeled examples of attacks.
In system design interviews, supervised learning appears in classification tasks (fraud detection, spam filtering, content moderation) and ranking tasks (search, recommendations). Unsupervised learning appears in anomaly detection (monitoring, security) and clustering (user segmentation, data exploration). Knowing which type applies to a given problem is the first design decision you should make.
Semi-supervised and Self-supervised Learning
In practice, labeled data is expensive and unlabeled data is abundant. Semi-supervised learning uses a small amount of labeled data to guide learning from a large amount of unlabeled data. Self-supervised learning creates labels from the data itself — for example, masking words in a sentence and training the model to predict them. This is how large language models like GPT and BERT are trained: the training signal comes from the structure of the text itself, requiring no human labeling.
Not every problem needs a neural network. The right model depends on your data size, feature types, latency requirements, and interpretability needs. Three families cover the vast majority of production ML systems.
Linear Models
Linear regression and logistic regression are the workhorses of ML. A linear model learns a weighted sum of input features — each feature gets a coefficient that represents its contribution to the prediction. Logistic regression adds a sigmoid function to produce probabilities for classification.
Why use linear models when more powerful alternatives exist? Interpretability — each coefficient directly tells you how much a feature influences the prediction. A fraud model with a large positive coefficient on "transaction_amount" and a large negative coefficient on "account_age" tells you exactly what the model learned. Speed — inference is a single matrix multiplication, completing in microseconds. Robustness — linear models rarely overfit on tabular data with many features. They are the right first model for any new problem because they set a baseline that more complex models must beat.
Tree-Based Models
Decision trees, random forests, and gradient-boosted trees (XGBoost, LightGBM, CatBoost) dominate tabular data problems. A decision tree splits data along feature thresholds — "if income > $80K and credit score > 700, approve the loan." Random forests train hundreds of trees on random subsets of data and average their predictions, reducing variance. Gradient boosting trains trees sequentially, where each new tree corrects the errors of the previous ones.
Tree-based models win on tabular data for three reasons. First, they handle mixed feature types (numerical, categorical, missing values) natively — no normalization or one-hot encoding required. Second, they capture non-linear relationships and feature interactions automatically. Third, gradient-boosted trees consistently win Kaggle competitions on tabular data, outperforming neural networks in most benchmarks.
A common mistake is jumping straight to deep learning for a tabular data problem. On structured data with well-engineered features, gradient-boosted trees (XGBoost, LightGBM) almost always outperform neural networks while training in minutes instead of hours and requiring no GPU. Reserve neural networks for unstructured data — images, text, audio, video — where they genuinely excel.
Neural Networks
Neural networks excel on unstructured data. CNNs (convolutional neural networks) dominate image recognition. Transformers dominate text, code, and increasingly everything else. The power of neural networks comes from representation learning — they automatically discover the right features from raw data, eliminating the need for manual feature engineering.
A CNN learns to detect edges in early layers, textures in middle layers, and objects in later layers — all from raw pixels, with no human telling it what edges or textures are. A transformer learns contextual word representations that capture meaning, syntax, and semantics — all from predicting the next word in billions of sentences.
The trade-off is cost. Neural networks require GPUs for training (hours to weeks), large datasets (thousands to millions of examples), and careful hyperparameter tuning. Inference is also more expensive — a transformer inference call takes milliseconds versus microseconds for a linear model. For infrastructure engineers, this means neural network models drive the need for GPU clusters, model serving infrastructure, and inference optimization techniques that simpler models do not require.
A model's quality is only as meaningful as the metric you measure it with. Choosing the wrong metric leads to optimizing for the wrong thing — and a model that looks great on paper but fails in production.
Accuracy and Its Limitations
Accuracy is the percentage of correct predictions. If a model correctly classifies 950 out of 1,000 emails, accuracy is 95%. Simple, intuitive — and dangerously misleading for imbalanced datasets.
Consider a fraud detection model. If 0.1% of transactions are fraudulent, a model that predicts "not fraud" for every single transaction achieves 99.9% accuracy. It catches zero fraud. Accuracy is useless when one class dominates because it rewards the model for ignoring the rare class entirely.
Precision, Recall, and F1
For imbalanced problems, you need metrics that focus on the class you care about:
Precision answers: "Of all the transactions I flagged as fraud, what percentage actually were fraud?" High precision means few false alarms. If precision is 90%, then 9 out of 10 flagged transactions are real fraud.
Recall answers: "Of all the actual fraud transactions, what percentage did I catch?" High recall means few missed frauds. If recall is 80%, the model catches 4 out of 5 fraudulent transactions.
F1 score is the harmonic mean of precision and recall — a single number that balances both. It penalizes models that sacrifice one for the other.

Precision and recall trade off against each other through the classification threshold. Lowering the threshold (flagging more transactions) increases recall but decreases precision — you catch more fraud but also flag more legitimate transactions. Raising the threshold increases precision but decreases recall — fewer false alarms but more missed fraud. The right threshold depends on the business cost of each error type.
Choosing the Right Metric
The correct metric depends on the cost of different errors:
High recall matters when missing a positive is expensive. Cancer screening — missing a malignant tumor is far worse than a false alarm that leads to additional testing. Fraud detection — missing fraud costs money directly.
High precision matters when false positives are expensive. Email spam filtering — marking a legitimate email as spam could cause a user to miss an important message. Content recommendation — showing irrelevant content degrades user trust.
When an interviewer asks about model evaluation, always connect the metric to business impact. Do not just say 'we use F1.' Say 'we optimize for recall at 95% precision because each missed fraud costs us 2 in manual review time. This means missing a fraud is 250x more expensive than a false alarm, so we bias toward catching more fraud even at the cost of more false alarms.'
Overfitting and the Train/Test Split
A model evaluated only on its training data will always look good — it has seen those exact examples before. Overfitting is when a model learns the noise and quirks of the training data instead of the underlying patterns, causing it to perform well on training data but poorly on new data.

The solution is to hold out data the model never sees during training. The standard split is training set (70-80%, used to learn), validation set (10-15%, used to tune hyperparameters and detect overfitting during training), and test set (10-15%, used once at the end for final evaluation). The test set must remain untouched until final evaluation — if you peek at it during development, you are effectively training on it and your final numbers are unreliable.