Deep Learning Foundations
Neural Network Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Optimization
Training a neural network is an optimization problem: minimize a loss function by adjusting millions of parameters. The algorithm that does this, the optimizer, determines whether training succeeds, how fast it converges, and whether the result generalizes. Understanding the evolution from plain gradient descent to Adam is not just historical context; it explains why each component of Adam exists and what problem it solves.
Vanilla Gradient Descent
The simplest update rule: take a step in the direction of steepest descent of the loss.
Vanilla (batch) gradient descent computes the gradient over the entire training set before updating. On a dataset of 1 million examples, that means a single weight update requires processing every example, far too slow for practical training.
Gradient descent — SGD on Convex bowl
Stochastic Gradient Descent (SGD)
SGD approximates the true gradient using a single random example (or a small mini-batch):
The gradient estimate is noisy, each mini-batch gives a different estimate of the true gradient. But this noise is actually useful: it prevents the optimizer from settling into sharp local minima and introduces regularization. The key hyperparameter is the learning rate : too high and the updates overshoot the minimum; too low and training takes forever.
The fundamental problem with SGD is that it uses the same learning rate for every parameter. In practice, some directions of parameter space are very steep (loss changes rapidly) while others are very flat (loss changes slowly). SGD either takes too large a step in the steep directions (overshooting) or too small a step in the flat directions (slow progress). This is the "narrow valley" problem.
SGD's noise is not a bug. It is a feature. Stochastic gradient noise helps escape sharp local minima and saddle points. Models trained with small batches (noisier gradients) often generalize better than those trained with large batches (cleaner gradients). The reason: sharp minima in the loss landscape generalize poorly, and SGD noise provides a natural barrier against converging to them. This is why learning rate warmup at the start of training (when you want stability) and later training phases with some noise are both justified.
Momentum
Momentum addresses SGD's tendency to oscillate. Instead of updating directly with the current gradient, maintain a running velocity vector that accumulates gradients:
With , the velocity is an exponential moving average of past gradients. The optimizer builds up speed in consistent directions (the "valley floor") while damping oscillations across the valley (where gradients alternate sign).
Intuitively: think of a ball rolling down a surface. Without momentum, the ball takes each step independently. It stops at every plateau. With momentum, the ball accumulates velocity and can roll through small plateaus and past noisy gradient directions.
RMSProp
Momentum solves oscillation but still uses the same learning rate for all parameters. RMSProp addresses this by maintaining a per-parameter running average of squared gradients and dividing each parameter's update by its own gradient magnitude:
Parameters with consistently large gradients get smaller effective learning rates (divided by large ). Parameters with small gradients get larger effective learning rates (divided by small ). This is adaptive learning rates per parameter, the optimizer automatically calibrates to the geometry of each parameter's loss surface.