Probability, Information Theory, and Statistics

Topics Covered

Bayes' Theorem and Estimation

Bayes' Theorem

Maximum Likelihood Estimation (MLE)

Maximum A Posteriori (MAP) Estimation

Entropy and Information

Entropy as Surprise

Why Entropy Measures Uncertainty

Attention Entropy and Entropy Regularization

Entropy of Model Predictions

Cross-Entropy and KL Divergence

Cross-Entropy: Derivation from MLE

The Formula

KL Divergence: The Distance Between Distributions

KL Divergence is Asymmetric

Mutual Information

Softmax and Temperature

Softmax: Normalized Exponentials

Numerical Stability: The Log-Sum-Exp Trick

Temperature Scaling

Temperature in Language Model Generation

Top-p (Nucleus) Sampling

Practice

Every time you train a neural network, you are doing statistics, even if the framework hides it from you. The loss function you minimize, the predictions you produce, and the uncertainty your model expresses are all probabilistic concepts. Understanding this foundation transforms loss functions from magic incantations into meaningful quantities you can reason about and debug.

Start with the question every learning algorithm is implicitly answering: given the data I've observed, what should I believe about the parameters of my model?

Bayes' Theorem

Bayes' theorem is a rule for updating beliefs in light of evidence:

P(θdata)=P(dataθ)P(θ)P(data)P(\theta \mid \text{data}) = \dfrac{P(\text{data} \mid \theta)\, P(\theta)}{P(\text{data})}

The left side is the posterior, what you believe about the parameters after seeing the data. On the right, P(dataθ)P(\text{data} \mid \theta) is the likelihood, how probable the observed data is under a given parameter setting. P(θ)P(\theta) is the prior, your belief before seeing any data. P(data)P(\text{data}) is a normalizing constant that ensures the posterior sums to one.

Bayes update: prior × likelihood ∝ posterior
-3-2-101234500.100.200.300.400.500.600.70θdensityPriorLikelihoodPosterior
Prior belief is wide; the likelihood concentrates probability where data is consistent; the posterior is narrower than both.

Concrete example: a spam classifier starts with prior P(spam)=0.1P(\text{spam}) = 0.1. Observing "FREE!!!" has likelihood 0.8 in spam versus 0.01 in legitimate mail. After the Bayes update, the posterior P(spamFREE!!!)P(\text{spam} \mid \text{FREE!!!}) jumps to about 0.90. Each additional observed word shifts the posterior further, multiplicatively compounding evidence.

Maximum Likelihood Estimation (MLE)

In most deep learning contexts, the prior is ignored, the parameters are treated as fixed unknowns rather than random variables. This is Maximum Likelihood Estimation: find the parameters that maximize the probability of the observed data.

Formally:

θMLE=argmaxθP(dataθ)\theta_{MLE} = \arg\max_{\theta} P(\text{data} \mid \theta)

Because products of small probabilities underflow to zero numerically, we maximize the log-likelihood instead (log is monotone, so the argmax is the same):

θMLE=argmaxθilogP(xiθ)\theta_{MLE} = \arg\max_{\theta} \sum_i \log P(x_i \mid \theta)

When you train a classifier with cross-entropy loss, you are doing MLE on a categorical distribution. The cross-entropy loss IS the negative log-likelihood. Minimizing it is exactly maximizing the probability that your model assigns to the correct labels.

Maximum A Posteriori (MAP) Estimation

MAP estimation puts the prior back in. Instead of maximizing just the likelihood, you maximize the posterior:

θMAP=argmaxθ[logP(dataθ)+logP(θ)]\theta_{MAP} = \arg\max_{\theta} \left[\log P(\text{data} \mid \theta) + \log P(\theta)\right]

The log-prior acts as a regularizer. If your prior is a Gaussian centered at zero, P(θ)exp(λθ2)P(\theta) \propto \exp(-\lambda \|\theta\|^2), then the log-prior term is λθ2-\lambda \|\theta\|^2. Maximizing the posterior corresponds to minimizing the negative log-likelihood plus an L2L_2 penalty on the weights. This is exactly L2L_2 regularization (weight decay).

In other words: every time you use weight decay in your optimizer, you are implicitly choosing a Gaussian prior over the parameters.

Key Insight

L2 regularization and MAP estimation with a Gaussian prior are the same thing. When you set weight_decay=0.01 in your optimizer, you are encoding a prior belief that the model's parameters should be close to zero. The regularization strength \lambda is the inverse variance of that prior. This is why heavier regularization produces 'simpler' models, the stronger the prior, the more the posterior is pulled toward zero.