Deep Learning Foundations
Neural Network Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Calculus and Automatic Differentiation
Backpropagation is not magic. It is a mechanical application of a single calculus rule, the chain rule, applied repeatedly across the layers of a neural network. Once you see this clearly, the entire training algorithm becomes transparent and the gradient pathologies that practitioners encounter (vanishing gradients, exploding gradients, dead neurons) become predictable failures with principled fixes.
Beginner: understand that backpropagation is just the chain rule applied to a computation graph.
Intermediate: implement autograd from scratch (the CodeExecutor problem in this lesson).
Advanced: study how JAX's tracing-based AD enables higher-order derivatives and how this connects to meta-learning.
The Chain Rule
If , then . The derivative of a composed function is the product of the derivatives of its parts. This generalizes to any chain of composed functions, and that is exactly what a neural network is.
For a 3-layer network with output , the derivative of with respect to the input of is the product of three Jacobians: . In practice, the chain rule is applied backward, from loss to inputs, because we need gradients with respect to all parameters, not just inputs. This is reverse-mode automatic differentiation.
Gradients and What They Tell You
The gradient of a scalar function with respect to a vector is a vector of partial derivatives, the gradient vector points in the direction of steepest ascent of . The negative gradient points in the direction of steepest descent.
In neural network training, is the loss and contains all the parameters (weights and biases). The goal is to find that minimizes . Gradient descent does this by repeatedly taking small steps in the direction of the negative gradient .
Gradient descent — SGD on Convex bowl
The learning rate controls step size. Too large and the steps overshoot the minimum, causing divergence. Too small and training takes exponentially longer to converge. Finding the right learning rate schedule is one of the core engineering challenges in deep learning training.
The Vanishing Gradient Problem
Every time a gradient passes backward through a layer, it is multiplied by the local derivative of that layer's activation function. If this derivative is consistently less than 1, the gradient shrinks exponentially as it travels deeper into the network.
The sigmoid activation function is a classic example. Its derivative is , which has a maximum of 0.25 at and falls toward 0 for large positive or negative inputs. In a 10-layer network where each layer multiplies the gradient by at most 0.25, the gradient reaching the first layer is at most of its value at the output. The first layers receive essentially no learning signal.
Sigmoid and its derivative
ReLU (Rectified Linear Unit) largely solves vanishing gradients: its derivative is either 0 (for negative inputs) or 1 (for positive inputs). A gradient passing through an active ReLU neuron is unchanged in magnitude. This is why ReLU and its variants (Leaky ReLU, GELU, SiLU) replaced sigmoid activations in deep networks. The trade-off: dead neurons. If a neuron's pre-activation is always negative, its gradient is always 0 and it never updates. It is "dead."
The vanishing gradient problem is not just a historical artifact. It resurfaces whenever you use activations with saturating regions (sigmoid, tanh) or build very deep networks without residual connections. ResNets solved the problem for CNNs by adding skip connections that let gradients bypass entire blocks. Transformers solve it via layer normalization and residual connections around every attention and FFN block.
Automatic differentiation works by building an explicit record of every computation. This record, the computational graph, is what separates modern deep learning frameworks from hand-differentiated code. Understanding it explains why PyTorch's .backward() works, why you must call .zero_grad() before each step, and why certain operations break gradient flow.
What is a Computational Graph
A computational graph represents a mathematical expression as a directed acyclic graph (DAG). Each node is an operation (add, multiply, relu, log) and each edge carries a tensor value from one operation to the next. Leaf nodes are the input tensors (data and parameters). The output node is the loss.
Consider the expression loss = (w * x + b - y) ** 2. This decomposes into a graph: multiply w and x, add b, subtract y, square the result. Each intermediate value is a node. The graph makes explicit which values depend on which other values, and therefore which gradients must be computed.
Chain rule on a computational graph
Dynamic vs Static Graphs
TensorFlow 1.x used static graphs: you define the graph once (declare_operations), compile it, then run it with different data. The graph is fixed at compile time. This enables compiler optimizations (operator fusion, memory planning) but makes debugging painful because errors appear at execution time, not definition time.
PyTorch uses dynamic graphs (also called define-by-run): the graph is built as your Python code executes. Every tensor operation adds a node to a graph that is constructed on the fly. The graph is different for each forward pass. This is how PyTorch handles if-statements and for-loops that depend on input values. This is essential for variable-length sequences and dynamic architectures like tree-structured networks.
The practical consequence: PyTorch is easier to debug (use a Python debugger, print intermediate values) but slightly harder to deploy efficiently (static graph compilers like torch.compile or TensorRT can optimize static subgraphs). JAX takes a middle road: functional programming style with explicit JIT compilation that traces through Python control flow.
The reason you must call optimizer.zero_grad() before every training step in PyTorch is that gradients accumulate (+=) in the .grad attribute of each tensor. If you do not zero them, the gradient from the current step is added to the gradient from the previous step. This is intentional. It enables gradient accumulation over multiple forward passes to simulate a larger batch size. But if you are not doing gradient accumulation, forgetting zero_grad() silently corrupts your training.
The most common autograd bug: calling loss.backward() without optimizer.zero_grad() first. PyTorch ACCUMULATES gradients by default — each backward() call ADDS to the existing .grad tensor. If you forget zero_grad(), your gradients grow every iteration and training diverges within a few steps. This design exists for gradient accumulation across micro-batches, but it trips up every beginner.
Gradient Tapes and Memory
PyTorch builds the graph lazily: when you call .backward(), it traverses the graph in reverse topological order, computing gradients using the chain rule and accumulating them at leaf nodes. After backward(), the intermediate activations (which were needed to compute gradients) can be freed.
This is the source of the memory asymmetry in training: training uses 3-4x more memory than inference because the forward pass must retain all intermediate activations until the backward pass is complete. With a 10-layer network and batch size 32, every activation tensor from every layer must be held in GPU memory simultaneously during forward pass. This is why gradient checkpointing (recomputing activations on demand during backward rather than storing them) is a standard technique for training very large models on limited memory.
Always use F.cross_entropy(), never F.softmax() + F.nll_loss(). The fused version is numerically stable and computes the elegant gradient softmax(z) - one_hot(y) directly.
In-place operations on tensors that require gradients will silently corrupt your backward pass. Operations like x.add_(1) or x[:, 0] = 0 modify the tensor that the autograd graph references, so the backward closure uses the mutated value instead of the original forward-pass value. PyTorch will raise a RuntimeError in many cases, but some in-place mutations slip through undetected. Always use out-of-place operations (x = x + 1, x = torch.cat(...)) when the tensor is part of the computation graph.
If your training suddenly produces NaN loss, the most common causes are: (1) learning rate too high, (2) log of zero or negative number in the loss, (3) division by zero in normalization, (4) float16 overflow. Add torch.autograd.set_detect_anomaly(True) temporarily to find which operation produced the NaN — but remove it for production training as it slows things down significantly.
When debugging gradient issues, use tensor.register_hook(lambda g: print(g.shape, g.norm())) to inspect gradients at any point in the computation graph. If a gradient is all zeros, the computation was detached somewhere upstream. If it is NaN, an upstream operation divided by zero or took log of a negative number. This is faster than torch.autograd.set_detect_anomaly(True) for targeted debugging.
Non-Differentiable Operations
Not every operation in a deep learning pipeline is differentiable. Argmax (taking the index of the maximum value) produces an integer index. Its gradient is zero almost everywhere, making it useless for training. This is why softmax + cross-entropy is used instead of argmax: softmax converts logits to a differentiable probability distribution.
Sampling (drawing from a distribution) is another non-differentiable operation. This matters for variational autoencoders and reinforcement learning. The reparameterization trick works around it: instead of sampling directly from , write where . The gradient flows through and , not through the sampling step. This is one of the most elegant tricks in differentiable programming.
Modern deep learning frameworks use reverse-mode automatic differentiation (AD), also called backpropagation, almost exclusively. But forward-mode AD is equally valid mathematically. Understanding why reverse mode dominates, and when forward mode is actually preferred, reveals a fundamental asymmetry in how parameters and outputs relate to each other.
Two Modes of Automatic Differentiation
Both forward and reverse mode AD compute exact derivatives using the chain rule, no approximation, unlike finite differences. The difference is in the order of computation.
Forward mode starts from an input variable and propagates derivative information forward through the graph. For each input variable, one forward pass computes how every intermediate and output value changes with respect to that one input. To compute all partial derivatives of all outputs with respect to a single input, forward mode requires one pass. To compute gradients with respect to all inputs, it requires passes.
Reverse mode starts from the output (the loss) and propagates gradient information backward. One backward pass computes how the loss changes with respect to every intermediate value and input. The full gradient with respect to all parameters is computed in a single backward pass regardless of how many parameters the network has.
The asymmetry is critical. A neural network has billions of parameters (inputs) and a scalar loss (output). Reverse mode computes the full gradient in one pass. Forward mode would require one pass per parameter, billions of passes. This is why backpropagation is tractable.
| Scenario | Inputs | Outputs | Forward mode passes | Reverse mode passes | Winner |
| Neural net training | 100M params | 1 (loss) | 100,000,000 | 1 | Reverse |
| Physics simulation | 8 initial conditions | 50,000 sensors | 8 | 50,000 | Forward |
| Single neuron | 1 weight | 1 output | 1 | 1 | Tie |
Autograd: forward values, backward gradients
When Forward Mode Wins
The rule is simple: use reverse mode when the function has many inputs and few outputs (loss functions), use forward mode when the function has few inputs and many outputs.
Physical simulations are a classic case for forward mode: a simulation may have 10 initial conditions (inputs) and produce 10,000 measurements (outputs). Reverse mode requires 10,000 backward passes (one per output); forward mode requires only 10 (one per input). JAX's jvp (Jacobian-vector product) enables efficient forward mode AD for exactly these cases.
The Jacobian-vector product (JVP) is the forward mode primitive: for a function and a direction vector , the JVP computes the directional derivative in a single forward pass. The vector-Jacobian product (VJP) is the reverse mode primitive: given a function and a cotangent vector , it computes in a single backward pass. PyTorch's autograd computes VJPs; JAX supports both.
The Survey Paper: Baydin et al. 2018
The canonical reference for understanding automatic differentiation in the machine learning context is "Automatic Differentiation in Machine Learning: a Survey" by Baydin, Pearlmutter, Radul, and Siskind (2018). It formalizes the distinction between forward and reverse mode, covers higher-order differentiation, discusses implementation strategies (operator overloading vs source transformation), and surveys the ecosystem of AD tools.
The key insight from the paper: what deep learning practitioners call "backpropagation" is reverse-mode AD applied to a specific class of computation (feedforward networks with scalar loss). The more general framework supports any differentiable computation, physics simulations, probabilistic programs, differential equations, not just neural networks.
JAX's design unifies forward and reverse mode in a single abstraction. jax.grad computes reverse-mode gradients (most ML use cases). jax.jvp computes forward-mode directional derivatives. jax.jacfwd and jax.jacrev compute full Jacobians in forward and reverse mode respectively. For functions with mixed input-output shapes, you can compose them: jacfwd(jacrev(f)) computes second-order derivatives (Hessians) efficiently by using reverse mode for the first derivative and forward mode for the second.
Higher-Order Derivatives
Reverse mode AD produces first derivatives. But many algorithms need second-order information: the Hessian matrix of second partial derivatives, used in Newton's method and natural gradient optimization.
The Hessian is expensive: an -parameter model has an Hessian, which is infeasible to compute or store for billion-parameter models. But Hessian-vector products (HVPs) can be computed efficiently without materializing the full Hessian: apply reverse mode AD to get the gradient, then apply forward mode AD to the gradient computation to get the HVP in a single forward-reverse pass pair. This enables second-order optimization methods that scale to large models.
In practice, full second-order optimization is rarely used for training large models due to memory and compute costs. But HVPs are used in: K-FAC (approximate natural gradient), influence functions (measuring how each training example affects the model), and some meta-learning algorithms.
The best way to understand automatic differentiation is to implement it. Building a minimal autograd engine from scratch forces you to confront questions that PyTorch hides: how are gradients stored? what happens when a node is used twice? how does topological ordering determine the backward pass sequence?
Andrej Karpathy's micrograd is the gold standard minimal implementation, under 100 lines of Python, yet it correctly implements reverse-mode AD for scalars and can train a multi-layer perceptron. Walking through it reveals that backpropagation is not a mysterious algorithm but a straightforward application of the chain rule to an explicit graph.
Core Data Structure: the Value Node
Each node in the computational graph is a Value object that wraps a scalar and maintains:
data: the forward pass valuegrad: the accumulated gradient (initialized to 0)_prev: the set of Value nodes that this node depends on (its inputs)_backward: a closure that, when called, computes and accumulates gradients into_prevnodes
The backward closure is defined at the time the operation is executed, capturing both the input values and the output node in its closure scope. This is the key design: the backward function is constructed lazily, at the same time as the forward computation.
Topological Sort and Backward
Before calling backward closures, we need to ensure we compute gradients in the right order: output first, inputs last. This is the reverse topological order of the graph, if node A feeds into node B, we must compute B's gradient before A's.
A DFS-based topological sort achieves this:
The post-order DFS ensures every node appears after all nodes that depend on it (in forward direction), equivalently, before all nodes it depends on (in backward direction).
The gradient accumulation in backward closures (using +=, not =) is not a coincidence. It is necessary for correctness in DAGs where a node is used more than once. If x appears in both a = x * 2 and b = x + 1, and we compute c = a + b, then dc/dx should be 3 (from both paths). Accumulation ensures both paths contribute their gradients to x.grad. If you used assignment (=) instead of +=, only the last backward closure to run would contribute, producing wrong gradients.
Why This Generalizes
The micrograd structure generalizes directly to tensor operations. In PyTorch, each operation on a tensor creates a grad_fn (the analog of _backward) that knows how to compute the gradient with respect to its inputs given the output gradient. The only difference is that scalars become tensors and the closure must implement the full Jacobian-vector product rather than a scalar multiply.
The autograd tape in PyTorch is essentially the _prev graph across all operations, with grad_fn playing the role of _backward. The backward() method performs the same topological sort and closure invocation, just at scale.
Practice: Implement Minimal Autograd
Build a Value class that supports addition, multiplication, and ReLU with correct backward passes. Your implementation should handle the DAG case (a node used multiple times) correctly via gradient accumulation.
Loading problem...