Autoencoders and Variational Autoencoders

Topics Covered

Vanilla Autoencoders

The Encoder-Decoder Decomposition

Why the Bottleneck Is the Point

The Relationship to PCA

Undercomplete vs Overcomplete

Architectural Choices That Matter

Where Vanilla Autoencoders Fall Short

Denoising and Sparse Autoencoders

The Problem With Pure Reconstruction

Denoising Autoencoders

Why Denoising Works as a Regularizer

Sparse Autoencoders

Sparse Autoencoders Come Back for Interpretability

Choosing Between Vanilla, Denoising, and Sparse

Variational Autoencoders

What Probabilistic Structure Buys You

The Generative Model

The Variational Trick

Understanding the KL Term

The Reconstruction-KL Trade-Off

Posterior Collapse

Why VAEs Matter Today

The Reparameterization Trick

The Problem: Backprop Through Sampling

The Trick: Move the Randomness Outside

Why This Is Remarkable

The Code Pattern

The Full Forward Pass

Beyond Gaussian: Where the Trick Generalizes

From VAEs to Diffusion Models

An autoencoder is the simplest idea in representation learning that actually works: train a neural network to copy its input to its output, but force the information to pass through a bottleneck in the middle. If the bottleneck is narrower than the input, the network cannot learn the identity function. It has to compress. And to compress, it has to find the regularities in the data, the patterns that let many examples share a shorter description.

That is the whole lesson of self-supervised learning in one paragraph. You do not need labels to learn what matters about your data. You just need a reconstruction task and an architectural constraint that prevents cheating.

The Encoder-Decoder Decomposition

A vanilla autoencoder factors into two networks. The encoder fθf_\theta maps an input xRDx \in \mathbb{R}^D to a latent code zRdz \in \mathbb{R}^d where dDd \ll D. The decoder gϕg_\phi maps the latent code back to a reconstruction x^RD\hat{x} \in \mathbb{R}^D. Training minimizes the reconstruction error:

LAE(θ,ϕ)=Expdata[xgϕ(fθ(x))2]\mathcal{L}_{\text{AE}}(\theta, \phi) = \mathbb{E}_{x \sim p_{\text{data}}} \left[ \| x - g_\phi(f_\theta(x)) \|^2 \right]

For images this is typically mean squared error (MSE) on pixel values. For binary data it is binary cross-entropy. For discrete categorical data it is cross-entropy over the vocabulary. The choice of reconstruction loss implicitly defines the noise model that the decoder is trying to invert. MSE assumes Gaussian noise, cross-entropy assumes categorical noise. This detail matters when we get to VAEs.

Autoencoder: encode through a bottleneck, then reconstruct
compressexpandcompare∇LAutoencoderx(input)enc 1enc 2zlatentdec 1dec 2(recon)Loss||x − x̂||²Optimizer(SGD / Adam)
The encoder compresses x to a narrow latent z; the decoder reconstructs x̂ from z. The Loss node compares input against reconstruction; gradient descent updates encoder and decoder.

Why the Bottleneck Is the Point

If the latent dimension dd equals the input dimension DD, nothing interesting happens. The encoder can learn the identity map, the decoder inverts it, and the reconstruction is perfect but the representation is useless. You learned nothing about the data.

If dDd \ll D, the encoder is forced to throw away information. It has to decide which information to keep. Keeping useful information (edges, shapes, object identity) means the decoder can still reconstruct the input well. Keeping useless information (sensor noise, exact pixel values, random artifacts) means the reconstruction error goes up. So gradient descent naturally steers the encoder toward a representation that preserves what matters for reconstruction and discards what does not.

This is why autoencoders learn features without labels. The reconstruction task itself defines "useful" implicitly: useful means predictive of the input. A neuron that fires on edges survives training because edges help reconstruction. A neuron that fires on random noise dies because it wastes bottleneck capacity on non-predictive information.

Key Insight

The word compression is load-bearing here. PCA is also a bottleneck compressor, but it is restricted to linear projections onto the top-d principal components. An autoencoder with nonlinear encoder and decoder generalizes this to arbitrary nonlinear manifolds, and with a deep architecture it can discover features that live on curved surfaces PCA cannot see. A single-layer linear autoencoder with MSE loss provably recovers PCA. Adding nonlinearity and depth unlocks everything else.

The Relationship to PCA

A useful grounding exercise. Suppose the encoder and decoder are both single linear layers with no activation function, and the loss is MSE. The encoder is z=Wexz = W_e x, the decoder is x^=Wdz\hat{x} = W_d z, and the overall reconstruction is x^=WdWex\hat{x} = W_d W_e x. Minimizing xWdWex2\| x - W_d W_e x \|^2 over the data gives a solution where the columns of WdW_d span the top-dd principal subspace. Up to rotation within that subspace, this is exactly PCA.

Nonlinear autoencoders inherit this geometric picture but with curved subspaces instead of flat ones. They find a dd-dimensional manifold embedded in RD\mathbb{R}^D that best covers the data. On natural images, that manifold looks roughly like "the set of images that could plausibly exist in the world", which is vanishingly small compared to the full space of 2563HW256^{3 \cdot H \cdot W} possible pixel arrays.

Undercomplete vs Overcomplete

Not all autoencoders have a narrow bottleneck. An undercomplete autoencoder has d<Dd < D. An overcomplete autoencoder has dDd \geq D. If the latent is wider than the input, the compression argument breaks. Why would you ever want that?

The answer is regularization. If you impose a constraint on the latent code (sparsity, noise injection, a prior distribution), an overcomplete latent can still learn useful features because the constraint forces the representation to be structured in some other way. A sparse autoencoder, for example, uses a wide latent but penalizes the number of non-zero activations. Each neuron ends up specializing in a specific pattern, like a single stroke in a handwritten digit or a specific texture in a photo. Denoising autoencoders (coming up in the next section) use a different regularization: reconstruct the clean input from a noisy one.

Both approaches give you representation learning without a narrow bottleneck. The bottleneck is a sufficient condition for learning useful features, not a necessary one. What is always necessary is some constraint that prevents the identity function from being the optimal solution.

Architectural Choices That Matter

For images, modern autoencoders use convolutional encoders with strided downsampling (reducing spatial resolution by 2x per layer) and transposed convolutions or pixel-shuffle upsampling in the decoder. A typical MNIST autoencoder encodes 28x28 images into 32-dimensional latents through 3-4 conv layers. A typical ImageNet autoencoder encodes 256x256 images into 4096-dimensional latents, often with a factor of 16x spatial compression.

For text, the encoder is usually a transformer producing a single pooled vector or a sequence of latent vectors. The decoder is an autoregressive transformer. But text autoencoders are not dominant in modern NLP, the BERT and GPT families mostly skip the compression step and use masked or causal language modeling instead. Autoencoders come back into play for text embeddings used in retrieval, where the pooled latent is the only thing you care about.

For tabular data or time series, MLPs work fine. An autoencoder with two or three fully connected layers in the encoder, a narrow bottleneck, and a mirror-image decoder is often enough. The architecture should match the structure of the data, not the other way around.

Where Vanilla Autoencoders Fall Short

Vanilla autoencoders have two big weaknesses that motivate everything that comes next. First, they learn a representation but not a generative model. The decoder can map any point in latent space to an output, but there is no reason to think that random latent codes will produce plausible outputs. If you pick a random zz and run it through the decoder, you usually get garbage. The latent space has no probabilistic structure, just the points that happened to be visited during training.

Second, the latent space is not smooth. Two nearby inputs might end up at wildly different latent codes because the encoder is under no pressure to organize the latent space geometrically. Interpolating between two latents in a straight line usually passes through non-data regions of the latent space, and the decoder produces nonsense along the interpolation path.

Both of these problems get fixed when we impose probabilistic structure on the latent space, which is exactly what VAEs do. First, though, we need to cover two useful variants of the vanilla autoencoder that fix a different set of problems: denoising and sparse autoencoders.