Diffusion Models

Deep Learning Foundations

Practical Training Decisions

Diffusion Models

Topics Covered

The Forward Noising Process

Why Start by Destroying Data?

The Forward Process Formally

The Beautiful Shortcut: Direct Sampling

Noise Schedules

What Does "t" Actually Mean to the Network?

Learning to Denoise

The Simplified Loss

Why Predicting Noise Works

The Reverse Sampling Loop

Faster Samplers: DDIM and Beyond

The Paper Study

UNet Architecture Details

Classifier-Free Guidance

The Naive Approach and Why It Fails

The Classifier Guidance Precursor

The Classifier-Free Guidance Trick

Why Guidance Scale Matters

How CFG Is Actually Implemented

Dropout During Training

Guidance Beyond Text

Latent Diffusion and Modern Systems

The Pixel-Space Cost Problem

The Latent Diffusion Architecture

Why the VAE Is Frozen

Cross-Attention for Text Conditioning

Why Latent Diffusion Works

The Paper Study

Modern Systems Built on Latent Diffusion

Video, Audio, and Beyond

Connection to the Autoencoders Lesson

Putting It All Together

If you had asked an ML researcher in 2019 to bet on the next dominant image generation paradigm, almost none of them would have picked diffusion. GANs were the reigning champion, VAEs were the principled alternative, and autoregressive pixel models were the slow-but-steady fallback. Then in 2020 a team at Berkeley published "Denoising Diffusion Probabilistic Models" (Ho et al.), and within two years diffusion was running DALL-E 2, Stable Diffusion, Imagen, and Midjourney. Every major image generator shipped in 2022 or later is some variant of this idea.

The idea is almost absurdly simple once you see it. Start with a real image. Gradually add small amounts of Gaussian noise until the image is indistinguishable from pure random noise. Now train a neural network to reverse that process: given a noisy image at some timestep, predict what noise was added. Once the network is good at that, you can start from pure noise and iteratively denoise until you arrive at a novel image. The forward process is fixed and requires no learning. Only the reverse process is learned, and even then, we frame it as a plain regression task: predict the noise.

This section covers the forward process, the part where we corrupt the data. We will look at why it is set up the way it is, what the math is really doing, and how you can sample any intermediate timestep in a single line of code rather than running a thousand-step loop.

Why Start by Destroying Data?

The first question anyone asks is: why would we want to destroy the data in the first place? Why not just learn to generate images directly, like a GAN does?

The honest answer is that learning to generate images directly is hard. A GAN generator has to map a single random noise vector to a full photorealistic image in one forward pass. That is a huge leap for a neural network to make, and it is the reason GANs are so unstable to train. Diffusion models break the problem into thousands of small, local steps. At each step, the network only has to remove a tiny bit of noise from a slightly-more-corrupted image. Each step is a small, well-defined regression problem. The full generation process emerges from chaining thousands of these easy steps together.

Think of it like carving a sculpture. You could try to sculpt the final figure in one swing of a hammer, which is almost impossible, or you could make thousands of small, incremental chips, each of which is easy, and end up with the same result. Diffusion takes the second path.

Forward diffusion: gradually noise x₀ until it becomes pure Gaussian

A fixed Markov chain adds Gaussian noise at each step. After T steps, x_T is indistinguishable from N(0, I). No learning required — only the reverse process is learned.

The Forward Process Formally

The forward process is a Markov chain. Start with a clean image $x_0$ sampled from the data distribution. At each timestep, we produce a slightly noisier version by sampling from a Gaussian centered on a rescaled version of the previous image:

q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I)

There are two things happening at each step. First, we shrink the existing image by a factor of $\sqrt{1 - \beta_t}$ . Second, we add fresh Gaussian noise with variance $\beta_t$ . The $\beta_t$ is called the noise schedule and determines how aggressively we corrupt the image at timestep $t$ . A typical choice is a linear schedule that starts at $\beta_1 = 0.0001$ and ends at $\beta_T = 0.02$ over $T = 1000$ steps.

Why do we shrink the image on the way in? Because if we only added noise without shrinking, the variance of the signal would grow unboundedly over time. The $\sqrt{1 - \beta_t}$ scaling keeps the total variance bounded. By the end of the process, $x_T$ is approximately a standard Gaussian, regardless of what $x_0$ was. This is the property that lets us start the reverse process from pure noise at inference time.

The Beautiful Shortcut: Direct Sampling

Here is the part of DDPM that often does not get enough appreciation. Suppose you want to compute $x_{500}$ given $x_0$ . Do you have to run the forward process 500 times, adding noise at each step?

No. Because the forward process is linear in $x_0$ plus Gaussian noise, you can collapse all 500 steps into a single sampling operation. Define:

\alpha_t = 1 - \beta_t, \quad \bar\alpha_t = \prod_{s=1}^t \alpha_s

Then the forward process gives you, for any timestep $t$ :

q(x_t \mid x_0) = \mathcal{N}(x_t;\, \sqrt{\bar\alpha_t}\, x_0,\, (1 - \bar\alpha_t) I)

Which we can rewrite as a closed-form sampling step using the reparameterization trick:

x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This is how you actually train diffusion models. At every training step, you sample a random timestep $t$ , compute $x_t$ directly from $x_0$ in one shot, and train the network to predict the $\epsilon$ that was used. There is no actual Markov chain at training time. It is pure convenience.

python

1import torch
2
3# precomputed schedule tensors of shape [T]
4# betas, alphas = 1 - betas, alpha_bars = cumprod(alphas)
5
6def q_sample(x0, t, sqrt_alpha_bar, sqrt_one_minus_alpha_bar):
7    """Sample x_t directly from x_0 in one shot."""
8    epsilon = torch.randn_like(x0)
9    # gather per-sample scaling factors and broadcast over spatial dims
10    sab = sqrt_alpha_bar[t].view(-1, 1, 1, 1)
11    somb = sqrt_one_minus_alpha_bar[t].view(-1, 1, 1, 1)
12    return sab * x0 + somb * epsilon, epsilon

The returned $\epsilon$ is exactly the training target. Remember that. The entire training loop hinges on this.

Key Insight

The forward process is fixed and requires zero learnable parameters. You can implement it in ten lines of code. All the learning happens in the reverse process, where a neural network predicts the noise that was added. This asymmetry is what makes diffusion models tractable to train: the difficult part (density estimation) is reduced to a plain MSE regression, and the easy part (corrupting data) does not need a model at all.

Noise Schedules

The linear schedule from DDPM works but it is not optimal. It adds noise too quickly in the middle of the schedule, so most of the timesteps are spent on images that are already almost pure noise. Later work proposed better schedules:

Cosine schedule (Nichol and Dhariwal, 2021): $\bar\alpha_t = \cos^2\!\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)$ with a small offset $s$ . This spends more timesteps in the "useful" middle range where the image is partially noisy. Standard for most modern systems.
Sigmoid schedule: Similar motivation, smoother at the endpoints.
Variance preserving vs variance exploding: DDPM is variance preserving (total variance stays bounded at 1). Some methods like score-based SDEs use variance exploding schedules where the noise grows without bound. Different math, similar results.

For almost any practical work you will use the cosine schedule or a close variant. The quality difference between linear and cosine is small but consistent, and there is no reason not to use the better one.

What Does "t" Actually Mean to the Network?

When the network sees a noisy image $x_t$ at inference time, it needs to know how noisy the image is. That is what the timestep $t$ tells it. Early in the denoising trajectory (large $t$ ), the network has very little signal to work with and should make broad, uncertain predictions. Late in the trajectory (small $t$ ), the image is almost clean and the network should make precise, detailed corrections. The timestep serves as a kind of "how much work is left" signal.

The timestep is typically encoded using a sinusoidal positional embedding (borrowed from transformers), projected through an MLP, and then injected into every block of the denoising network via FiLM-style modulation or additive conditioning. This is a small implementation detail but essential. Without the timestep signal, the network cannot know whether it is looking at an early-trajectory slight-corruption or a late-trajectory deep-noise image, and the same network would have to do two very different jobs blindly. If you ever debug a diffusion model producing garbage output, one of the first things to check is whether the timestep embedding is actually wired into every residual block. The symptom is a model that trains to a reasonable loss but generates samples that look like random noise, because the network is averaging over all timesteps since it cannot tell them apart.

Course

Deep Learning Foundations

Mathematical Foundations

Neural Network Foundations

Representation Learning

Generative Models Beyond Language

Vision and Modern Self-Supervised Learning

Practical Training Decisions