Deep Learning Foundations
Mathematical Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Regularization and Generalization
The most seductive trap in machine learning is a training loss curve that keeps going down. You check every 100 steps. It drops. Every 1000 steps - still dropping. You feel like you are making progress. Meanwhile, the model is doing something more sinister: it is memorizing the training set rather than learning anything about the underlying distribution.
This is overfitting. And understanding why it happens, not just that it happens, changes how you think about model design.
Overfitting: train loss keeps dropping, val loss turns up
What Overfitting Actually Is
A neural network with millions of parameters has enough capacity to perfectly memorize any finite training set. Given 50,000 labeled images, a large enough network can learn a lookup table: "if the pixel pattern matches example 7,423, output class 3." The training loss reaches zero. The model has learned nothing generalizable.
The problem is that the training set is a sample from some underlying distribution. A model that memorizes the sample has not learned the distribution. It has learned the noise. When you show it a new sample from the same distribution, it fails, because new samples contain different noise.
Formally, a model's expected error on new data decomposes into three parts:
- Bias: Error from incorrect assumptions about the data. A linear model fitting a curved relationship has high bias. It will be wrong even with infinite data.
- Variance: Error from sensitivity to the training sample. A high-capacity model that memorizes noise has high variance, a different training sample would produce a wildly different model.
- Irreducible noise: Error from randomness in the data itself. This sets a floor that no model can beat.
The bias-variance tradeoff says: models that reduce bias tend to increase variance, and vice versa. A degree-1 polynomial fit to data has high bias (underfitting). A degree-15 polynomial has low bias but high variance (overfitting). The sweet spot is in between.
Why the Classical Tradeoff Breaks Down
Classical statistics predicted a U-shaped test error curve as model complexity increases: decrease from high bias, hit a minimum, then increase as variance dominates. Neural networks routinely violate this prediction.
Modern overparameterized models, where the number of parameters exceeds the number of training examples, can achieve near-zero training loss (perfectly interpolating the data) and generalize well to new examples. This is the double descent phenomenon, covered in detail in the final section of this lesson.
For now, the practical implication is: the classical bias-variance tradeoff tells you what happens in the classical regime (model capacity less than data size). Beyond the interpolation threshold, something else is operating. Regularization techniques were designed for the classical regime, but they continue to help even in the overparameterized regime - for different reasons.
The bias-variance tradeoff is a decomposition of expected test error, not a constraint. It says variance causes a certain type of error. Regularization reduces that type of error by constraining the model's solution space. The question is always: which solutions are you ruling out, and are the solutions you keep better than the ones you discard?
Diagnosing the Problem
The signature of overfitting is a growing gap between training loss and validation loss. Training loss decreases; validation loss plateaus or increases. The size of the gap tells you how much the model has overfit.
Underfitting looks different: both training and validation loss are high and roughly equal. The model has not yet learned the pattern in the data. Adding capacity (more layers, more parameters) or training longer usually helps.
The diagnostic tells you which regime you are in, which tells you which remedy applies. If overfitting: regularize, add data, reduce capacity. If underfitting: add capacity, reduce regularization, check data quality.