Deep Learning Foundations
Mathematical Foundations
Neural Network Foundations
Representation Learning
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Diffusion Models
If you had asked an ML researcher in 2019 to bet on the next dominant image generation paradigm, almost none of them would have picked diffusion. GANs were the reigning champion, VAEs were the principled alternative, and autoregressive pixel models were the slow-but-steady fallback. Then in 2020 a team at Berkeley published "Denoising Diffusion Probabilistic Models" (Ho et al.), and within two years diffusion was running DALL-E 2, Stable Diffusion, Imagen, and Midjourney. Every major image generator shipped in 2022 or later is some variant of this idea.
The idea is almost absurdly simple once you see it. Start with a real image. Gradually add small amounts of Gaussian noise until the image is indistinguishable from pure random noise. Now train a neural network to reverse that process: given a noisy image at some timestep, predict what noise was added. Once the network is good at that, you can start from pure noise and iteratively denoise until you arrive at a novel image. The forward process is fixed and requires no learning. Only the reverse process is learned, and even then, we frame it as a plain regression task: predict the noise.
This section covers the forward process, the part where we corrupt the data. We will look at why it is set up the way it is, what the math is really doing, and how you can sample any intermediate timestep in a single line of code rather than running a thousand-step loop.
Why Start by Destroying Data?
The first question anyone asks is: why would we want to destroy the data in the first place? Why not just learn to generate images directly, like a GAN does?
The honest answer is that learning to generate images directly is hard. A GAN generator has to map a single random noise vector to a full photorealistic image in one forward pass. That is a huge leap for a neural network to make, and it is the reason GANs are so unstable to train. Diffusion models break the problem into thousands of small, local steps. At each step, the network only has to remove a tiny bit of noise from a slightly-more-corrupted image. Each step is a small, well-defined regression problem. The full generation process emerges from chaining thousands of these easy steps together.
Think of it like carving a sculpture. You could try to sculpt the final figure in one swing of a hammer, which is almost impossible, or you could make thousands of small, incremental chips, each of which is easy, and end up with the same result. Diffusion takes the second path.
Forward diffusion: gradually noise x₀ until it becomes pure Gaussian
The Forward Process Formally
The forward process is a Markov chain. Start with a clean image sampled from the data distribution. At each timestep, we produce a slightly noisier version by sampling from a Gaussian centered on a rescaled version of the previous image:
There are two things happening at each step. First, we shrink the existing image by a factor of . Second, we add fresh Gaussian noise with variance . The is called the noise schedule and determines how aggressively we corrupt the image at timestep . A typical choice is a linear schedule that starts at and ends at over steps.
Why do we shrink the image on the way in? Because if we only added noise without shrinking, the variance of the signal would grow unboundedly over time. The scaling keeps the total variance bounded. By the end of the process, is approximately a standard Gaussian, regardless of what was. This is the property that lets us start the reverse process from pure noise at inference time.
The Beautiful Shortcut: Direct Sampling
Here is the part of DDPM that often does not get enough appreciation. Suppose you want to compute given . Do you have to run the forward process 500 times, adding noise at each step?
No. Because the forward process is linear in plus Gaussian noise, you can collapse all 500 steps into a single sampling operation. Define:
Then the forward process gives you, for any timestep :
Which we can rewrite as a closed-form sampling step using the reparameterization trick:
This is how you actually train diffusion models. At every training step, you sample a random timestep , compute directly from in one shot, and train the network to predict the that was used. There is no actual Markov chain at training time. It is pure convenience.
The returned is exactly the training target. Remember that. The entire training loop hinges on this.
The forward process is fixed and requires zero learnable parameters. You can implement it in ten lines of code. All the learning happens in the reverse process, where a neural network predicts the noise that was added. This asymmetry is what makes diffusion models tractable to train: the difficult part (density estimation) is reduced to a plain MSE regression, and the easy part (corrupting data) does not need a model at all.
Noise Schedules
The linear schedule from DDPM works but it is not optimal. It adds noise too quickly in the middle of the schedule, so most of the timesteps are spent on images that are already almost pure noise. Later work proposed better schedules:
- Cosine schedule (Nichol and Dhariwal, 2021): with a small offset . This spends more timesteps in the "useful" middle range where the image is partially noisy. Standard for most modern systems.
- Sigmoid schedule: Similar motivation, smoother at the endpoints.
- Variance preserving vs variance exploding: DDPM is variance preserving (total variance stays bounded at 1). Some methods like score-based SDEs use variance exploding schedules where the noise grows without bound. Different math, similar results.
For almost any practical work you will use the cosine schedule or a close variant. The quality difference between linear and cosine is small but consistent, and there is no reason not to use the better one.
What Does "t" Actually Mean to the Network?
When the network sees a noisy image at inference time, it needs to know how noisy the image is. That is what the timestep tells it. Early in the denoising trajectory (large ), the network has very little signal to work with and should make broad, uncertain predictions. Late in the trajectory (small ), the image is almost clean and the network should make precise, detailed corrections. The timestep serves as a kind of "how much work is left" signal.
The timestep is typically encoded using a sinusoidal positional embedding (borrowed from transformers), projected through an MLP, and then injected into every block of the denoising network via FiLM-style modulation or additive conditioning. This is a small implementation detail but essential. Without the timestep signal, the network cannot know whether it is looking at an early-trajectory slight-corruption or a late-trajectory deep-noise image, and the same network would have to do two very different jobs blindly. If you ever debug a diffusion model producing garbage output, one of the first things to check is whether the timestep embedding is actually wired into every residual block. The symptom is a model that trains to a reasonable loss but generates samples that look like random noise, because the network is averaging over all timesteps since it cannot tell them apart.