ML Systems & Infrastructure
Data Infrastructure
Training Infrastructure
Model Serving
Evaluation and Testing
Production Operations
Specialized Systems and Capstone
Deep Learning Building Blocks
A neural network is a stack of simple mathematical operations that, when composed, can learn complex patterns. Each layer transforms its input into a slightly more abstract representation. By the time data passes through several layers, raw inputs (pixels, words, numbers) have been converted into representations that make the final prediction easy.
Neurons and Layers
A single neuron does three things: (1) multiply each input by a learned weight, (2) sum the weighted inputs plus a bias term, (3) pass the result through an activation function. That is it — weighted sum plus nonlinearity. The power comes from stacking thousands of neurons into layers, and stacking layers into deep networks.
A layer is a group of neurons that all process the same input simultaneously. The input layer receives raw data. Hidden layers transform data through successive abstractions. The output layer produces the final prediction — a class probability, a numeric value, or a sequence of tokens.

Activation Functions
Without activation functions, a neural network is just a chain of matrix multiplications — which collapses into a single linear transformation no matter how many layers you stack. Activation functions introduce nonlinearity, allowing the network to learn curved decision boundaries and complex patterns.
ReLU (Rectified Linear Unit) is the default activation: output the input if positive, zero otherwise. It is computationally cheap and avoids the vanishing gradient problem that plagued earlier activations like sigmoid and tanh. Most hidden layers in modern networks use ReLU or its variants (Leaky ReLU, GELU).
Softmax converts raw output scores into a probability distribution across classes — all outputs sum to 1. Used in the final layer for classification tasks.
Sigmoid squishes output to the 0-1 range. Used in the final layer for binary classification (is this spam or not?).
The depth in 'deep learning' is what gives neural networks their power. Each layer learns to detect increasingly abstract features — edges become textures, textures become object parts, object parts become whole objects. A shallow network with one hidden layer can theoretically represent any function, but it would need an impractically large number of neurons. Depth makes the representation efficient.
Forward Pass and Backpropagation
The forward pass pushes data through the network from input to output, producing a prediction. The loss function measures how wrong that prediction is. Backpropagation then computes how much each weight contributed to the error, working backward from the output layer to the input layer. Each weight gets a gradient — a signal indicating which direction to adjust it to reduce the loss. Gradient descent then updates all weights simultaneously by a small step in the gradient direction.
This cycle — forward pass, compute loss, backpropagation, update weights — repeats millions of times during training. Each cycle uses a batch of training examples (typically 32-512 samples), and one complete pass through the entire training dataset is called an epoch. Training runs for tens to hundreds of epochs until the loss converges.
You do not need to implement backpropagation — frameworks like PyTorch and TensorFlow handle it automatically through automatic differentiation. But understanding the concept matters because it explains why training is computationally expensive (every weight needs a gradient), why GPUs help (gradient computation parallelizes across neurons and examples), and why certain architectures train better than others (gradient flow determines whether deep networks learn effectively).
Different data types have different structural properties. Images have spatial structure — nearby pixels are related. Sequences have temporal structure — the meaning of a word depends on surrounding words. Specialized architectures exploit these structures to learn more efficiently than generic fully-connected networks.
Convolutional Neural Networks (CNNs)
A CNN exploits the spatial structure of images. Instead of connecting every pixel to every neuron (which would require billions of parameters for a 1080p image), a CNN slides small filters (typically 3x3 or 5x5 pixel windows) across the image. Each filter detects a specific local pattern — an edge, a corner, a texture.

The key ideas behind CNNs:
Parameter sharing — the same filter slides across the entire image. A vertical-edge detector works the same way in the top-left corner as in the bottom-right. This dramatically reduces parameters compared to a fully-connected network.
Translation invariance — because filters slide across the image, a cat is recognized as a cat regardless of where it appears. The model does not need separate examples of "cat in top-left" and "cat in center."
Hierarchical features — early layers detect edges (simple patterns), middle layers combine edges into textures and shapes (medium complexity), and deep layers recognize objects (high-level concepts). This hierarchy emerges automatically from training — no one programs "detect edges in layer 1."
Pooling — max pooling and average pooling reduce spatial dimensions by summarizing small regions (e.g., taking the maximum value in each 2x2 block). This makes the representation increasingly compact and invariant to small translations and distortions.
CNNs dominate image recognition, object detection, video analysis, and medical imaging. They also work well for any data with spatial structure — spectrograms (audio), molecular structures (drug discovery), and game boards (Go, chess).
Recurrent Neural Networks (RNNs) and LSTMs
RNNs process sequential data by maintaining a hidden state that carries information from previous time steps. At each step, the network takes the current input and the previous hidden state, producing an output and an updated hidden state. This creates a chain of processing where the output at step 10 can depend on input from step 1.
The problem with basic RNNs is the vanishing gradient issue — during backpropagation through many time steps, gradients shrink exponentially, making it impossible for the network to learn long-range dependencies. A sentence like "The cat, which had been sitting on the mat for several hours while the dog slept nearby, finally stood up" requires connecting "cat" to "stood up" across many intervening words.
LSTMs (Long Short-Term Memory) solve this with a gating mechanism. Three gates control information flow: the forget gate decides what to discard from memory, the input gate decides what new information to store, and the output gate decides what to expose as the current hidden state. This allows LSTMs to maintain relevant information across hundreds of time steps.
RNNs and LSTMs are largely historical for most NLP tasks — transformers have replaced them because transformers process all positions in parallel rather than sequentially. However, RNNs remain relevant for understanding why transformers were invented (to solve the sequential bottleneck) and still appear in certain time-series and edge-device applications where their constant memory footprint matters.
The critical limitation of RNNs is sequential processing — step N cannot begin until step N-1 finishes. This makes RNNs impossible to parallelize across sequence positions, which is devastating for training speed. Processing a 1,000-word document requires 1,000 sequential steps, while a transformer processes all 1,000 words simultaneously.
The transformer architecture, introduced in the 2017 paper "Attention Is All You Need," is the foundation of every modern large language model (GPT, Claude, Gemini), most modern search ranking systems, and an increasing number of vision and multimodal models. Understanding attention is understanding how modern AI works.
The Self-Attention Mechanism
The core idea: when processing a word in a sentence, the model should attend to other relevant words regardless of their distance. In "The bank approved the loan because the applicant had excellent credit," the word "bank" should attend strongly to "loan" and "approved" to understand that this is a financial institution, not a riverbank.

Self-attention works in three steps:
Step 1: Generate queries, keys, and values. Each token is projected through three learned matrices to produce a query vector (what am I looking for?), a key vector (what do I contain?), and a value vector (what information do I carry?).
Step 2: Compute attention scores. Each query is dot-producted with every key to produce a score indicating relevance. High score means the key's token is highly relevant to the query's token. These scores are normalized via softmax to produce attention weights that sum to 1.
Step 3: Weighted sum of values. Each token's output is the weighted sum of all value vectors, where the weights come from the attention scores. A token with high attention to another token incorporates more of that token's information into its representation.
Multi-head attention runs this process multiple times in parallel (typically 8-96 heads), each with different learned projections. Different heads learn to attend to different types of relationships — one head might focus on syntactic structure (subject-verb), another on semantic similarity, another on positional proximity. The outputs from all heads are concatenated and projected back to the model dimension.
The Transformer Block
A transformer is a stack of identical blocks, each containing:
- Multi-head self-attention — each token attends to all other tokens
- Feed-forward network — two linear layers with a nonlinearity (GELU or ReLU) applied independently to each token's representation
- Layer normalization — stabilizes training by normalizing activations
- Residual connections — add the input of each sub-layer to its output, allowing gradients to flow directly through the network and enabling training of very deep models (100+ layers)
GPT-4 class models stack 100+ of these blocks with thousands of attention heads and billions of parameters. The architecture is simple — the power comes from scale.
In system design interviews involving LLMs (chatbots, search ranking, content generation), you do not need to explain attention math. But knowing that attention is O(n^2) in sequence length explains why context windows have limits, why long-document processing is expensive, and why techniques like sliding window attention and sparse attention exist — these are infrastructure-relevant constraints that affect system design decisions.
Why Transformers Won
Transformers solved three problems simultaneously:
Parallelism — unlike RNNs, transformers process all positions simultaneously. Training on a 1,000-token sequence uses 1,000 parallel attention computations instead of 1,000 sequential RNN steps. This enables training on internet-scale datasets in weeks rather than years.
Long-range dependencies — attention connects every position to every other position directly. A word at position 1 can attend to a word at position 1,000 with equal ease. RNNs must propagate information through 999 sequential steps, losing signal along the way.
Scalability — transformers follow clean scaling laws: more data + more parameters + more compute = better performance. This predictable scaling enabled the strategy behind modern LLMs — throw more resources at the same architecture and performance improves reliably.
Training a neural network is not a single decision — it is a series of interrelated choices that determine whether the model learns effectively. Understanding these dynamics explains why training a large model takes weeks on hundreds of GPUs and costs millions of dollars.
Why Training Needs GPUs
Neural network training is dominated by matrix multiplication. A forward pass multiplies input matrices by weight matrices at every layer. Backpropagation multiplies gradient matrices by weight matrices in reverse. Each of these operations involves thousands of multiply-and-add operations that are independent — element (i,j) of the output matrix does not depend on element (i,k).
CPUs execute these operations sequentially or with limited parallelism (8-64 cores). GPUs execute thousands of operations simultaneously (thousands of CUDA cores). A matrix multiplication that takes 10 seconds on CPU completes in 0.1 seconds on GPU. For training runs that require trillions of operations, this 100x speedup is the difference between weeks and decades.

Modern GPU training hardware (NVIDIA A100, H100) provides additional optimizations: tensor cores that compute entire small matrix multiplications in one clock cycle, high-bandwidth memory (HBM) that feeds data to cores fast enough to keep them busy, and NVLink interconnects for multi-GPU communication during distributed training.
Epochs, Batches, and Learning Rate
Epoch — one complete pass through the entire training dataset. Training runs for 10 to 300+ epochs depending on dataset size and model complexity.
Batch size — the number of training examples processed together before updating weights. Larger batches give more accurate gradient estimates but require more memory. Typical values range from 32 to 4,096. For LLM training, effective batch sizes can reach millions of tokens through gradient accumulation.
Learning rate — the step size for weight updates. Too high and the model overshoots optima, oscillating wildly. Too low and training converges painfully slowly or gets stuck in poor local minima. Learning rate schedules (warmup + decay) start with a low rate, increase to a peak, then gradually decrease — this combination provides fast early training with precise final convergence.
These three knobs interact. Doubling the batch size effectively halves the noise in gradient estimates, which often allows doubling the learning rate. Linear scaling rules like this are used to scale training to hundreds of GPUs: each GPU processes a batch shard, gradients are averaged, and the learning rate scales proportionally.
The Cost of Scale
Training GPT-4 class models reportedly costs $50-100 million in compute alone. This cost comes from three factors multiplied together:
Parameters — more parameters require more memory (to store weights and gradients) and more compute (larger matrix multiplications). A 70B parameter model requires at least 140 GB of GPU memory just for weights in FP16.
Data — modern LLMs train on trillions of tokens. More data means more forward and backward passes, more compute, more time.
Compute — the relationship follows scaling laws: performance improves predictably with more compute, but the returns are logarithmic. Doubling compute improves performance by a fixed, small amount. Each marginal improvement costs more than the last.
For infrastructure engineers, this means training systems must maximize GPU utilization. An idle GPU wastes $2-3 per hour. A 1,000-GPU training cluster running at 50% utilization instead of 90% wastes $30,000-50,000 per day. Efficient data loading, gradient communication, and fault tolerance are critical infrastructure problems.