Practical Training Decisions

Topics Covered

Choosing Loss Functions

Classification: Cross-Entropy Is Not Optional

Class Imbalance: Focal Loss

Ranking and Retrieval: Triplet and Contrastive

Regression: MSE, L1, or Huber?

Why "What to Optimize" Is Half the Battle

A Quick Decision Table

Cross-reference

Hyperparameter Tuning Strategies

The Knob Ranking (From 10 Years of Watching People Train Models)

Karpathy's Learning Rate Finder

Random Search Beats Grid Search (The Bergstra Result)

Bayesian Optimization: When to Bother

Linear Scaling Rule for Batch Size

The Practical Tuning Workflow

When Training Fails

The Loss Curve: Reading the Patterns

Gradient Norms: The Second Most Important Signal

Weight Norms: The Slower Signal

NaN Debugging: A Recipe

Common Failure Modes and Their Fixes

What to Log From Day One

Debugging with Deliberate Overfitting

Why This Works

The Procedure

What It Catches

The Scaling Rule

A Real Bug I Caught in 30 Seconds

When 100 Examples Is Not Enough

The Bigger Principle

Cross-reference

Half of every training bug I have ever debugged was actually a loss function bug. The model architecture was fine, the data was fine, the optimizer was fine, but the thing being optimized was not what the engineer actually wanted. Pick the wrong loss and a perfectly-designed model will confidently converge to something useless.

Andrej Karpathy's "A Recipe for Training Neural Networks" opens with the warning that "neural net training is a leaky abstraction." What he means is that model.fit(x, y) hides an enormous number of decisions, and the most important of those decisions is: what number are we actually minimizing? Your gradients do not know what you want. They only know what the loss tells them.

This section is a tour of the loss functions that matter for the majority of real-world problems, with hard rules for when to reach for each one.

Classification: Cross-Entropy Is Not Optional

For classification problems, use cross-entropy. Not mean squared error on one-hot labels, not hinge loss, not something clever you read about on Twitter. Cross-entropy.

For a true class y and predicted probabilities p, cross-entropy is LCE=logpy\mathcal{L}_{CE} = -\log p_y. When the model assigns high probability to the correct class, the loss is small. When it assigns vanishingly small probability to the correct class, the loss blows up toward infinity, producing a large gradient that rapidly pulls the prediction in the right direction.

The reason cross-entropy is the right choice is subtle but important. If you use MSE on one-hot labels, the gradient with respect to the logits becomes small whenever the prediction is already far from the target, because MSE has a sigmoid saturating behavior on its output. This means a model that is very wrong gets a tiny learning signal, exactly the opposite of what you want. Cross-entropy's gradient with respect to the logits is simply p - y, which stays large precisely when the model is very wrong. It is self-calibrating in a way MSE is not.

Cross-entropy vs MSE: gradient signal when the model is wrong
00.200.400.600.80100.200.400.600.801Predicted probability of true class (p)|gradient| at logitCross-entropy: |1 − p|MSE on sigmoid: 2|p−1|·p(1−p)
Plot of |∂L/∂logit| vs predicted probability of the true class. Cross-entropy gives the largest gradient exactly where the model is most wrong (p → 0). MSE on a sigmoid output saturates: a confidently-wrong model gets almost no learning signal. This is why classification needs cross-entropy, not MSE.

In PyTorch, always use F.cross_entropy(logits, targets), which takes raw logits and computes log-softmax + NLL in a numerically stable way. Never apply softmax yourself and then pass into F.nll_loss, because you will lose precision in the log. This is a real bug I have seen ship to production: a team computed softmax probabilities for logging, then reused those probabilities for the loss, and their 70B model trained with 4-5 bits of effective precision in the loss computation.

Class Imbalance: Focal Loss

Cross-entropy handles balanced classification problems beautifully. But when 99% of your examples are class 0 and 1% are class 1 (think fraud detection, medical imaging, object detection backgrounds), cross-entropy runs into a specific failure mode: the model learns to predict the majority class for everything, reaches 99% accuracy, and the gradient signal from the remaining 1% of minority examples gets drowned out.

Focal loss, introduced in the RetinaNet paper by Lin et al. at FAIR, fixes this by explicitly down-weighting easy examples so that the gradient focuses on the hard ones. The formula multiplies cross-entropy by a factor that shrinks toward zero as the model becomes confident on an example.

FL(pt)=(1pt)γlog(pt)\text{FL}(p_t) = -(1 - p_t)^\gamma \log(p_t)
python
1import torch
2import torch.nn.functional as F
3
4def focal_loss(logits, targets, gamma=2.0):
5    ce = F.cross_entropy(logits, targets, reduction='none')
6    p_t = torch.exp(-ce)
7    return ((1 - p_t) ** gamma * ce).mean()
Focal loss down-weights well-classified examples so gradients concentrate on hard cases. Use it when class imbalance is severe (1:100 or worse) and standard reweighting does not converge.

The key parameter is gamma, the focusing strength. At gamma=0, focal loss reduces to regular cross-entropy. At gamma=2 (the RetinaNet default), an example with 90% confidence gets its loss weight multiplied by (1 - 0.9)^2 = 0.01, effectively removed from the gradient. Examples where the model is wrong (low confidence on the correct class) keep their full weight.

Focal loss is the right tool for dense prediction tasks like object detection, where every image has thousands of negative "no object" locations and maybe five positive "object here" locations. It is also useful for fraud detection and rare-event classification. It is not a drop-in replacement for cross-entropy on balanced problems, where it tends to hurt calibration.

Ranking and Retrieval: Triplet and Contrastive

When the goal is not "predict the right label" but "embed similar things near each other and dissimilar things far apart," cross-entropy is the wrong framing entirely. These are ranking and retrieval tasks, and they need metric losses.

Triplet loss takes three examples at a time: an anchor a, a positive p (something the anchor should be close to), and a negative n (something the anchor should be far from). The loss pushes the anchor toward the positive and away from the negative until the distance to the negative exceeds the distance to the positive by at least a margin.

Ltriplet=max(0,d(a,p)d(a,n)+margin)\mathcal{L}_{triplet} = \max(0, d(a, p) - d(a, n) + \text{margin})

When the negative is already sufficiently far (d(a,n) > d(a,p) + margin), the loss is zero and that triplet contributes no gradient. This is a feature: it means training focuses on hard triplets where the model has not yet separated positive and negative. Most production face recognition systems (FaceNet, ArcFace lineage) use variants of this loss because it directly optimizes the thing you care about at inference time: the relative distance between embeddings.

Contrastive loss (InfoNCE, used in CLIP and SimCLR) generalizes this to many negatives at once. Instead of one positive and one negative, you contrast the positive against a batch of negatives and apply cross-entropy over the similarity scores. This is the loss behind every modern retrieval model, dense passage retrieval, and the two-tower architectures used in search.

The rule: if your problem is "find similar items" or "rank by relevance," you need a metric loss. Cross-entropy on classification labels will not give you embeddings that respect distance the way you expect.

Regression: MSE, L1, or Huber?

For continuous-target regression, the choice between MSE (L2), MAE (L1), and Huber depends on what your outliers mean.

MSE penalizes errors quadratically. A prediction that is off by 10 contributes 100 to the loss; one that is off by 1 contributes 1. This makes MSE extremely sensitive to outliers. One bad label or one genuine tail-of-the-distribution example can dominate training. MSE is appropriate when your errors are Gaussian and your labels are clean, which is rarely the case in the real world.

L1 penalizes errors linearly. A prediction off by 10 contributes 10, off by 1 contributes 1. This is much more robust to outliers, but the gradient at zero is discontinuous, which can cause optimization pain near convergence and gives you a slow final approach to the optimum.

Huber loss (smooth L1) is MSE near zero and L1 for large errors. It has the robustness of L1 far from the optimum and the smooth convergence of MSE near it. This is the right default for most regression problems with noisy data, and it is what the YOLO family of object detectors uses for bounding box regression. If you do not know which to pick, pick Huber.

Why "What to Optimize" Is Half the Battle

The hardest lesson I ever learned about training: you will not accidentally optimize the right thing. The loss you write down is the thing the model will become. If your loss does not capture what you want, no amount of architecture engineering or data augmentation will fix it.

A concrete example. A team at a recommendation startup I consulted for was training a model to rank items for users. They used binary cross-entropy on click labels and spent months tuning the architecture. Their model was great at predicting clicks. But the business cared about purchases, not clicks, and clicks are a biased proxy dominated by clickbait. They switched the loss to a weighted combination of click and purchase signals, and the same architecture beat every previous version by a huge margin. The problem was never the model. It was the loss.

Key Insight

Before you touch architecture or hyperparameters, write down what you actually want to optimize. Not what is easy to compute, not what the library provides, but what you would measure to decide if the model is good. If your loss does not match that, fix the loss first. Every other change is rearranging deck chairs.

A Quick Decision Table

Problem typeUseDon't use
Balanced classificationCross-entropyMSE, hinge
Imbalanced classification (1:100+)Focal loss or weighted CEPlain CE
Ranking / retrieval / embeddingsTriplet, contrastive (InfoNCE)Classification CE
Clean regression, Gaussian noiseMSE-
Noisy regression with outliersHuber (smooth L1)MSE
Object detection box regressionHuber or GIoU lossMSE
Multi-label classificationBinary cross-entropy per labelSoftmax CE
Sequence-to-sequence (translation)Cross-entropy with label smoothingPlain CE

Label smoothing deserves a note. For classification over large vocabularies (language modeling, translation) the one-hot target is too confident and the model learns to make overconfident predictions that hurt calibration. Smoothing the target to (1 - eps) for the true class and eps / (V - 1) for others, with eps = 0.1, is a near-universal win. It is the default in every modern transformer training recipe for a reason.

Cross-reference

The cross-entropy gradient is one reason Adam-style optimizers work so well for classification. See the optimization lesson for why the gradient shape of your loss interacts with your optimizer choice.