Deep Learning Foundations
Mathematical Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Training Dynamics
Deep networks fail to train more often than they succeed, and not because the math is wrong. The math is fine. The problem is gradient flow. Every technique covered in this lesson, initialization, normalization, mixed precision, exists specifically to keep gradients well-behaved. Understanding the failure modes first makes every subsequent solution obvious rather than mysterious.
The Core Problem
During backpropagation, gradients are computed by repeatedly applying the chain rule through each layer. For a network with layers, the gradient at layer 1 involves multiplying Jacobian matrices together:
If each Jacobian has typical singular value , then after multiplications the gradient shrinks by , exponentially toward zero. With sigmoid activations and , a 10-layer network attenuates the gradient by .
If each Jacobian has typical singular value , the gradient grows by , exponentially toward infinity. The gradient explodes, weights receive enormous updates, and training diverges.
Gradient magnitude by layer depth
Why This Matters in Practice
Vanishing gradients mean the first layers of the network essentially don't train. The parameters that encode low-level features, in a language model, things like basic syntax and token patterns, receive no learning signal. The network can memorize patterns in its later layers but cannot build abstractions from scratch.
Exploding gradients manifest differently: the loss goes to NaN after a few steps, or you observe loss values jumping by orders of magnitude between batches. The training run is simply broken.
Gradient Clipping, The Exploding Gradient Fix
For exploding gradients, the standard solution is gradient clipping:
This rescales the gradient vector if its global norm exceeds max_norm. It's a blunt instrument. It prevents runaway updates without addressing the underlying cause, but it works reliably. Nearly every transformer training recipe includes gradient clipping at 1.0.
The intuition: if the gradient is pointing in a sensible direction but is too large, scale it down to have norm 1. Direction preserved, magnitude controlled.
For vanishing gradients, clipping does nothing, you can't amplify a near-zero gradient. The real solutions are initialization and normalization.
If your training loss goes NaN in the first few steps, check gradient norms before clipping. A gradient norm of 10^6 on step 1 means your initialization is too large or your learning rate is too high. If loss goes NaN after 10,000 steps, check for a degenerate input batch (all zeros, NaN features) or a numerical issue in your loss function (\log(0), divide by zero).