Deep Learning Foundations
Neural Network Foundations
Representation Learning
Generative Models Beyond Language
Vision and Modern Self-Supervised Learning
Practical Training Decisions
Probability, Information Theory, and Statistics
Every time you train a neural network, you are doing statistics, even if the framework hides it from you. The loss function you minimize, the predictions you produce, and the uncertainty your model expresses are all probabilistic concepts. Understanding this foundation transforms loss functions from magic incantations into meaningful quantities you can reason about and debug.
Start with the question every learning algorithm is implicitly answering: given the data I've observed, what should I believe about the parameters of my model?
Bayes' Theorem
Bayes' theorem is a rule for updating beliefs in light of evidence:
The left side is the posterior, what you believe about the parameters after seeing the data. On the right, is the likelihood, how probable the observed data is under a given parameter setting. is the prior, your belief before seeing any data. is a normalizing constant that ensures the posterior sums to one.
Bayes update: prior × likelihood ∝ posterior
Concrete example: a spam classifier starts with prior . Observing "FREE!!!" has likelihood 0.8 in spam versus 0.01 in legitimate mail. After the Bayes update, the posterior jumps to about 0.90. Each additional observed word shifts the posterior further, multiplicatively compounding evidence.
Maximum Likelihood Estimation (MLE)
In most deep learning contexts, the prior is ignored, the parameters are treated as fixed unknowns rather than random variables. This is Maximum Likelihood Estimation: find the parameters that maximize the probability of the observed data.
Formally:
Because products of small probabilities underflow to zero numerically, we maximize the log-likelihood instead (log is monotone, so the argmax is the same):
When you train a classifier with cross-entropy loss, you are doing MLE on a categorical distribution. The cross-entropy loss IS the negative log-likelihood. Minimizing it is exactly maximizing the probability that your model assigns to the correct labels.
Maximum A Posteriori (MAP) Estimation
MAP estimation puts the prior back in. Instead of maximizing just the likelihood, you maximize the posterior:
The log-prior acts as a regularizer. If your prior is a Gaussian centered at zero, , then the log-prior term is . Maximizing the posterior corresponds to minimizing the negative log-likelihood plus an penalty on the weights. This is exactly regularization (weight decay).
In other words: every time you use weight decay in your optimizer, you are implicitly choosing a Gaussian prior over the parameters.
L2 regularization and MAP estimation with a Gaussian prior are the same thing. When you set weight_decay=0.01 in your optimizer, you are encoding a prior belief that the model's parameters should be close to zero. The regularization strength \lambda is the inverse variance of that prior. This is why heavier regularization produces 'simpler' models, the stronger the prior, the more the posterior is pulled toward zero.