Optimization

Topics Covered

Gradient Descent Variants

Vanilla Gradient Descent

Stochastic Gradient Descent (SGD)

Momentum

RMSProp

Adam Optimizer

The Adam Update Rule

Why Each Component Exists

What Adam Does to the Loss Landscape

AdamW: Weight Decay Done Right

Learning Rate Schedules

The Three Phases of Training

Linear Warmup

Cosine Decay

Constant Learning Rate

Learning Rate for Fine-Tuning

Loss Landscapes and Convergence

What a Loss Landscape Looks Like

Local Minima: Less Problematic Than Expected

Sharp vs Flat Minima and Generalization

Saddle Points: The Real Problem

Why Transformers Need Warmup

Gradient Clipping

Training a neural network is an optimization problem: minimize a loss function by adjusting millions of parameters. The algorithm that does this, the optimizer, determines whether training succeeds, how fast it converges, and whether the result generalizes. Understanding the evolution from plain gradient descent to Adam is not just historical context; it explains why each component of Adam exists and what problem it solves.

Vanilla Gradient Descent

The simplest update rule: take a step in the direction of steepest descent of the loss.

θθηθL\theta \leftarrow \theta - \eta \, \nabla_\theta L

Vanilla (batch) gradient descent computes the gradient over the entire training set before updating. On a dataset of 1 million examples, that means a single weight update requires processing every example, far too slow for practical training.

Gradient descent — SGD on Convex bowl
-2-1012-2-1012xy
Vanilla gradient descent on a convex bowl. Each step moves toward the minimum along the steepest-descent direction.

Stochastic Gradient Descent (SGD)

SGD approximates the true gradient using a single random example (or a small mini-batch):

θθηθLbatch\theta \leftarrow \theta - \eta \, \nabla_\theta L_{\text{batch}}

The gradient estimate is noisy, each mini-batch gives a different estimate of the true gradient. But this noise is actually useful: it prevents the optimizer from settling into sharp local minima and introduces regularization. The key hyperparameter is the learning rate η\eta: too high and the updates overshoot the minimum; too low and training takes forever.

The fundamental problem with SGD is that it uses the same learning rate for every parameter. In practice, some directions of parameter space are very steep (loss changes rapidly) while others are very flat (loss changes slowly). SGD either takes too large a step in the steep directions (overshooting) or too small a step in the flat directions (slow progress). This is the "narrow valley" problem.

Key Insight

SGD's noise is not a bug. It is a feature. Stochastic gradient noise helps escape sharp local minima and saddle points. Models trained with small batches (noisier gradients) often generalize better than those trained with large batches (cleaner gradients). The reason: sharp minima in the loss landscape generalize poorly, and SGD noise provides a natural barrier against converging to them. This is why learning rate warmup at the start of training (when you want stability) and later training phases with some noise are both justified.

Momentum

Momentum addresses SGD's tendency to oscillate. Instead of updating directly with the current gradient, maintain a running velocity vector that accumulates gradients:

vβv+(1β)θLv \leftarrow \beta v + (1 - \beta)\, \nabla_\theta L
θθηv\theta \leftarrow \theta - \eta \, v

With β=0.9\beta = 0.9, the velocity is an exponential moving average of past gradients. The optimizer builds up speed in consistent directions (the "valley floor") while damping oscillations across the valley (where gradients alternate sign).

Intuitively: think of a ball rolling down a surface. Without momentum, the ball takes each step independently. It stops at every plateau. With momentum, the ball accumulates velocity and can roll through small plateaus and past noisy gradient directions.

python
1# Momentum in NumPy
2v = np.zeros_like(theta)
3for step in range(num_steps):
4    g = compute_gradient(theta)
5    v = 0.9 * v + 0.1 * g        # accumulate velocity
6    theta = theta - lr * v

RMSProp

Momentum solves oscillation but still uses the same learning rate for all parameters. RMSProp addresses this by maintaining a per-parameter running average of squared gradients and dividing each parameter's update by its own gradient magnitude:

sβs+(1β)(θL)2s \leftarrow \beta s + (1 - \beta)\, (\nabla_\theta L)^2
θθηθLs+ϵ\theta \leftarrow \theta - \dfrac{\eta \, \nabla_\theta L}{\sqrt{s} + \epsilon}

Parameters with consistently large gradients get smaller effective learning rates (divided by large s\sqrt{s}). Parameters with small gradients get larger effective learning rates (divided by small s\sqrt{s}). This is adaptive learning rates per parameter, the optimizer automatically calibrates to the geometry of each parameter's loss surface.

python
1# RMSProp in NumPy
2s = np.zeros_like(theta)
3for step in range(num_steps):
4    g = compute_gradient(theta)
5    s = 0.999 * s + 0.001 * (g ** 2)   # running avg of squared gradients
6    theta = theta - lr * g / (np.sqrt(s) + 1e-8)