Calculus and Automatic Differentiation

Deep Learning Foundations

Practical Training Decisions

Calculus and Automatic Differentiation

Topics Covered

Chain Rule and Gradients

The Chain Rule

Gradients and What They Tell You

The Vanishing Gradient Problem

Computational Graphs

What is a Computational Graph

Dynamic vs Static Graphs

Gradient Tapes and Memory

Non-Differentiable Operations

Forward vs Reverse Mode Automatic Differentiation

Two Modes of Automatic Differentiation

When Forward Mode Wins

The Survey Paper: Baydin et al. 2018

Higher-Order Derivatives

Implementing Autograd

Core Data Structure: the Value Node

Topological Sort and Backward

Why This Generalizes

Practice: Implement Minimal Autograd

Backpropagation is not magic. It is a mechanical application of a single calculus rule, the chain rule, applied repeatedly across the layers of a neural network. Once you see this clearly, the entire training algorithm becomes transparent and the gradient pathologies that practitioners encounter (vanishing gradients, exploding gradients, dead neurons) become predictable failures with principled fixes.

Level Expectations

Beginner: understand that backpropagation is just the chain rule applied to a computation graph.

Intermediate: implement autograd from scratch (the CodeExecutor problem in this lesson).

Advanced: study how JAX's tracing-based AD enables higher-order derivatives and how this connects to meta-learning.

The Chain Rule

If $y = f(g(x))$ , then $\dfrac{dy}{dx} = \dfrac{df}{dg} \cdot \dfrac{dg}{dx}$ . The derivative of a composed function is the product of the derivatives of its parts. This generalizes to any chain of composed functions, and that is exactly what a neural network is.

For a 3-layer network with output $L = f_3(f_2(f_1(x)))$ , the derivative of $L$ with respect to the input of $f_1$ is the product of three Jacobians: $\dfrac{\partial L}{\partial f_3} \cdot \dfrac{\partial f_3}{\partial f_2} \cdot \dfrac{\partial f_2}{\partial f_1}$ . In practice, the chain rule is applied backward, from loss to inputs, because we need gradients with respect to all parameters, not just inputs. This is reverse-mode automatic differentiation.

\dfrac{dy}{dx} = \dfrac{df}{dg} \cdot \dfrac{dg}{dx}

python

1# y = f(g(x)). Forward and backward by hand.
2import numpy as np
3
4g = lambda x: x ** 2          # inner function
5f = lambda u: 3 * u + 1       # outer function
6
7x  = 2.0
8u  = g(x)                     # 4.0
9y  = f(u)                     # 13.0
10
11# Local derivatives
12dg_dx = 2 * x                 # = 4
13df_du = 3                     # = 3
14
15# Chain rule: multiply local derivatives along the path
16dy_dx = df_du * dg_dx         # = 12
17print(dy_dx)

The chain rule in code, each layer contributes a local derivative, and you multiply them along the path from output to input.

Gradients and What They Tell You

The gradient of a scalar function $L(w)$ with respect to a vector $w$ is a vector of partial derivatives, the gradient vector points in the direction of steepest ascent of $L$ . The negative gradient points in the direction of steepest descent.

In neural network training, $L$ is the loss and $w$ contains all the parameters (weights and biases). The goal is to find $w$ that minimizes $L$ . Gradient descent does this by repeatedly taking small steps in the direction of the negative gradient $w_{\text{new}} = w - \eta \nabla_w L(w)$ .

Gradient descent — SGD on Convex bowl

Negative gradient points toward the minimum; each step follows steepest descent on the loss surface.

The learning rate $\eta$ controls step size. Too large and the steps overshoot the minimum, causing divergence. Too small and training takes exponentially longer to converge. Finding the right learning rate schedule is one of the core engineering challenges in deep learning training.

The Vanishing Gradient Problem

Every time a gradient passes backward through a layer, it is multiplied by the local derivative of that layer's activation function. If this derivative is consistently less than 1, the gradient shrinks exponentially as it travels deeper into the network.

The sigmoid activation function is a classic example. Its derivative is $\sigma(x)(1 - \sigma(x))$ , which has a maximum of 0.25 at $x = 0$ and falls toward 0 for large positive or negative inputs. In a 10-layer network where each layer multiplies the gradient by at most 0.25, the gradient reaching the first layer is at most $(0.25)^{10} = 1/1{,}000{,}000$ of its value at the output. The first layers receive essentially no learning signal.

Sigmoid and its derivative

Sigmoid saturates for large inputs, its derivative peaks at 0.25 and vanishes at the tails, causing gradients to shrink exponentially through layers.

ReLU (Rectified Linear Unit) largely solves vanishing gradients: its derivative is either 0 (for negative inputs) or 1 (for positive inputs). A gradient passing through an active ReLU neuron is unchanged in magnitude. This is why ReLU and its variants (Leaky ReLU, GELU, SiLU) replaced sigmoid activations in deep networks. The trade-off: dead neurons. If a neuron's pre-activation is always negative, its gradient is always 0 and it never updates. It is "dead."

Key Insight

The vanishing gradient problem is not just a historical artifact. It resurfaces whenever you use activations with saturating regions (sigmoid, tanh) or build very deep networks without residual connections. ResNets solved the problem for CNNs by adding skip connections that let gradients bypass entire blocks. Transformers solve it via layer normalization and residual connections around every attention and FFN block.

Computational Graphs

Automatic differentiation works by building an explicit record of every computation. This record, the computational graph, is what separates modern deep learning frameworks from hand-differentiated code. Understanding it explains why PyTorch's .backward() works, why you must call .zero_grad() before each step, and why certain operations break gradient flow.

What is a Computational Graph

A computational graph represents a mathematical expression as a directed acyclic graph (DAG). Each node is an operation (add, multiply, relu, log) and each edge carries a tensor value from one operation to the next. Leaf nodes are the input tensors (data and parameters). The output node is the loss.

Consider the expression loss = (w * x + b - y) ** 2. This decomposes into a graph: multiply w and x, add b, subtract y, square the result. Each intermediate value is a node. The graph makes explicit which values depend on which other values, and therefore which gradients must be computed.

Chain rule on a computational graph

Forward pass: compute each intermediate value. Backward pass: apply chain rule along each edge to accumulate gradients at every node.

Dynamic vs Static Graphs

TensorFlow 1.x used static graphs: you define the graph once (declare_operations), compile it, then run it with different data. The graph is fixed at compile time. This enables compiler optimizations (operator fusion, memory planning) but makes debugging painful because errors appear at execution time, not definition time.

PyTorch uses dynamic graphs (also called define-by-run): the graph is built as your Python code executes. Every tensor operation adds a node to a graph that is constructed on the fly. The graph is different for each forward pass. This is how PyTorch handles if-statements and for-loops that depend on input values. This is essential for variable-length sequences and dynamic architectures like tree-structured networks.

The practical consequence: PyTorch is easier to debug (use a Python debugger, print intermediate values) but slightly harder to deploy efficiently (static graph compilers like torch.compile or TensorRT can optimize static subgraphs). JAX takes a middle road: functional programming style with explicit JIT compilation that traces through Python control flow.

Interview Tip

The reason you must call optimizer.zero_grad() before every training step in PyTorch is that gradients accumulate (+=) in the .grad attribute of each tensor. If you do not zero them, the gradient from the current step is added to the gradient from the previous step. This is intentional. It enables gradient accumulation over multiple forward passes to simulate a larger batch size. But if you are not doing gradient accumulation, forgetting zero_grad() silently corrupts your training.

Common Pitfall

The most common autograd bug: calling loss.backward() without optimizer.zero_grad() first. PyTorch ACCUMULATES gradients by default — each backward() call ADDS to the existing .grad tensor. If you forget zero_grad(), your gradients grow every iteration and training diverges within a few steps. This design exists for gradient accumulation across micro-batches, but it trips up every beginner.

Gradient Tapes and Memory

PyTorch builds the graph lazily: when you call .backward(), it traverses the graph in reverse topological order, computing gradients using the chain rule and accumulating them at leaf nodes. After backward(), the intermediate activations (which were needed to compute gradients) can be freed.

This is the source of the memory asymmetry in training: training uses 3-4x more memory than inference because the forward pass must retain all intermediate activations until the backward pass is complete. With a 10-layer network and batch size 32, every activation tensor from every layer must be held in GPU memory simultaneously during forward pass. This is why gradient checkpointing (recomputing activations on demand during backward rather than storing them) is a standard technique for training very large models on limited memory.

python

1import torch
2from torch.utils.checkpoint import checkpoint
3# Wrap each transformer block to trade compute for memory
4for block in transformer.layers:
5    x = checkpoint(block, x, use_reentrant=False)

Interview Tip

Always use F.cross_entropy(), never F.softmax() + F.nll_loss(). The fused version is numerically stable and computes the elegant gradient softmax(z) - one_hot(y) directly.

Common Pitfall

In-place operations on tensors that require gradients will silently corrupt your backward pass. Operations like x.add_(1) or x[:, 0] = 0 modify the tensor that the autograd graph references, so the backward closure uses the mutated value instead of the original forward-pass value. PyTorch will raise a RuntimeError in many cases, but some in-place mutations slip through undetected. Always use out-of-place operations (x = x + 1, x = torch.cat(...)) when the tensor is part of the computation graph.

Common Pitfall

If your training suddenly produces NaN loss, the most common causes are: (1) learning rate too high, (2) log of zero or negative number in the loss, (3) division by zero in normalization, (4) float16 overflow. Add torch.autograd.set_detect_anomaly(True) temporarily to find which operation produced the NaN — but remove it for production training as it slows things down significantly.

Interview Tip

When debugging gradient issues, use tensor.register_hook(lambda g: print(g.shape, g.norm())) to inspect gradients at any point in the computation graph. If a gradient is all zeros, the computation was detached somewhere upstream. If it is NaN, an upstream operation divided by zero or took log of a negative number. This is faster than torch.autograd.set_detect_anomaly(True) for targeted debugging.

Non-Differentiable Operations

Not every operation in a deep learning pipeline is differentiable. Argmax (taking the index of the maximum value) produces an integer index. Its gradient is zero almost everywhere, making it useless for training. This is why softmax + cross-entropy is used instead of argmax: softmax converts logits to a differentiable probability distribution.

Sampling (drawing from a distribution) is another non-differentiable operation. This matters for variational autoencoders and reinforcement learning. The reparameterization trick works around it: instead of sampling $z$ directly from $\mathcal{N}(\mu, \sigma^2)$ , write $z = \mu + \sigma \cdot \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$ . The gradient flows through $\mu$ and $\sigma$ , not through the sampling step. This is one of the most elegant tricks in differentiable programming.

Forward vs Reverse Mode Automatic Differentiation

Modern deep learning frameworks use reverse-mode automatic differentiation (AD), also called backpropagation, almost exclusively. But forward-mode AD is equally valid mathematically. Understanding why reverse mode dominates, and when forward mode is actually preferred, reveals a fundamental asymmetry in how parameters and outputs relate to each other.

Two Modes of Automatic Differentiation

Both forward and reverse mode AD compute exact derivatives using the chain rule, no approximation, unlike finite differences. The difference is in the order of computation.

Forward mode starts from an input variable and propagates derivative information forward through the graph. For each input variable, one forward pass computes how every intermediate and output value changes with respect to that one input. To compute all partial derivatives of all outputs with respect to a single input, forward mode requires one pass. To compute gradients with respect to all $n$ inputs, it requires $n$ passes.

Reverse mode starts from the output (the loss) and propagates gradient information backward. One backward pass computes how the loss changes with respect to every intermediate value and input. The full gradient with respect to all parameters is computed in a single backward pass regardless of how many parameters the network has.

The asymmetry is critical. A neural network has billions of parameters (inputs) and a scalar loss (output). Reverse mode computes the full gradient in one pass. Forward mode would require one pass per parameter, billions of passes. This is why backpropagation is tractable.

Scenario	Inputs	Outputs	Forward mode passes	Reverse mode passes	Winner
Neural net training	100M params	1 (loss)	100,000,000	1	Reverse
Physics simulation	8 initial conditions	50,000 sensors	8	50,000	Forward
Single neuron	1 weight	1 output	1	1	Tie

Autograd: forward values, backward gradients

Forward pass populates values (blue). Backward pass walks the graph in reverse (orange, dashed), accumulating gradients at every node in one linear sweep.

When Forward Mode Wins

The rule is simple: use reverse mode when the function has many inputs and few outputs (loss functions), use forward mode when the function has few inputs and many outputs.

Physical simulations are a classic case for forward mode: a simulation may have 10 initial conditions (inputs) and produce 10,000 measurements (outputs). Reverse mode requires 10,000 backward passes (one per output); forward mode requires only 10 (one per input). JAX's jvp (Jacobian-vector product) enables efficient forward mode AD for exactly these cases.

The Jacobian-vector product (JVP) is the forward mode primitive: for a function $f$ and a direction vector $v$ , the JVP computes the directional derivative $(\partial f / \partial x) v$ in a single forward pass. The vector-Jacobian product (VJP) is the reverse mode primitive: given a function $f$ and a cotangent vector $v$ , it computes $v^\top (\partial f / \partial x)$ in a single backward pass. PyTorch's autograd computes VJPs; JAX supports both.

The Survey Paper: Baydin et al. 2018

The canonical reference for understanding automatic differentiation in the machine learning context is "Automatic Differentiation in Machine Learning: a Survey" by Baydin, Pearlmutter, Radul, and Siskind (2018). It formalizes the distinction between forward and reverse mode, covers higher-order differentiation, discusses implementation strategies (operator overloading vs source transformation), and surveys the ecosystem of AD tools.

The key insight from the paper: what deep learning practitioners call "backpropagation" is reverse-mode AD applied to a specific class of computation (feedforward networks with scalar loss). The more general framework supports any differentiable computation, physics simulations, probabilistic programs, differential equations, not just neural networks.

Key Insight

JAX's design unifies forward and reverse mode in a single abstraction. jax.grad computes reverse-mode gradients (most ML use cases). jax.jvp computes forward-mode directional derivatives. jax.jacfwd and jax.jacrev compute full Jacobians in forward and reverse mode respectively. For functions with mixed input-output shapes, you can compose them: jacfwd(jacrev(f)) computes second-order derivatives (Hessians) efficiently by using reverse mode for the first derivative and forward mode for the second.

Higher-Order Derivatives

Reverse mode AD produces first derivatives. But many algorithms need second-order information: the Hessian matrix of second partial derivatives, used in Newton's method and natural gradient optimization.

The Hessian is expensive: an $n$ -parameter model has an $n \times n$ Hessian, which is infeasible to compute or store for billion-parameter models. But Hessian-vector products (HVPs) can be computed efficiently without materializing the full Hessian: apply reverse mode AD to get the gradient, then apply forward mode AD to the gradient computation to get the HVP in a single forward-reverse pass pair. This enables second-order optimization methods that scale to large models.

In practice, full second-order optimization is rarely used for training large models due to memory and compute costs. But HVPs are used in: K-FAC (approximate natural gradient), influence functions (measuring how each training example affects the model), and some meta-learning algorithms.

Implementing Autograd

The best way to understand automatic differentiation is to implement it. Building a minimal autograd engine from scratch forces you to confront questions that PyTorch hides: how are gradients stored? what happens when a node is used twice? how does topological ordering determine the backward pass sequence?

Andrej Karpathy's micrograd is the gold standard minimal implementation, under 100 lines of Python, yet it correctly implements reverse-mode AD for scalars and can train a multi-layer perceptron. Walking through it reveals that backpropagation is not a mysterious algorithm but a straightforward application of the chain rule to an explicit graph.

Core Data Structure: the Value Node

Each node in the computational graph is a Value object that wraps a scalar and maintains:

data: the forward pass value
grad: the accumulated gradient (initialized to 0)
_prev: the set of Value nodes that this node depends on (its inputs)
_backward: a closure that, when called, computes and accumulates gradients into _prev nodes

The backward closure is defined at the time the operation is executed, capturing both the input values and the output node in its closure scope. This is the key design: the backward function is constructed lazily, at the same time as the forward computation.

python

1def __mul__(self, other):
2    out = Value(self.data * other.data, (self, other), '*')
3    def _backward():
4        # chain rule: d(self*other)/d(self) = other
5        self.grad += other.data * out.grad
6        other.grad += self.data * out.grad
7    out._backward = _backward
8    return out

Topological Sort and Backward

Before calling backward closures, we need to ensure we compute gradients in the right order: output first, inputs last. This is the reverse topological order of the graph, if node A feeds into node B, we must compute B's gradient before A's.

A DFS-based topological sort achieves this:

python

1def backward(self):
2    topo = []
3    visited = set()
4    def build_topo(v):
5        if v not in visited:
6            visited.add(v)
7            for child in v._prev:
8                build_topo(child)
9            topo.append(v)
10    build_topo(self)
11    self.grad = 1.0  # seed: dL/dL = 1
12    for node in reversed(topo):
13        node._backward()

The post-order DFS ensures every node appears after all nodes that depend on it (in forward direction), equivalently, before all nodes it depends on (in backward direction).

Interview Tip

The gradient accumulation in backward closures (using +=, not =) is not a coincidence. It is necessary for correctness in DAGs where a node is used more than once. If x appears in both a = x * 2 and b = x + 1, and we compute c = a + b, then dc/dx should be 3 (from both paths). Accumulation ensures both paths contribute their gradients to x.grad. If you used assignment (=) instead of +=, only the last backward closure to run would contribute, producing wrong gradients.

Why This Generalizes

The micrograd structure generalizes directly to tensor operations. In PyTorch, each operation on a tensor creates a grad_fn (the analog of _backward) that knows how to compute the gradient with respect to its inputs given the output gradient. The only difference is that scalars become tensors and the closure must implement the full Jacobian-vector product rather than a scalar multiply.

The autograd tape in PyTorch is essentially the _prev graph across all operations, with grad_fn playing the role of _backward. The backward() method performs the same topological sort and closure invocation, just at scale.

Practice: Implement Minimal Autograd

Build a Value class that supports addition, multiplication, and ReLU with correct backward passes. Your implementation should handle the DAG case (a node used multiple times) correctly via gradient accumulation.

Loading problem...

Course

Deep Learning Foundations

Mathematical Foundations

Neural Network Foundations

Representation Learning

Generative Models Beyond Language

Vision and Modern Self-Supervised Learning

Practical Training Decisions