BERT HuggingFace gives NaN `Loss`
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
A NaN loss during BERT training is almost never “random.” It usually means one part of the training pipeline has already gone numerically invalid: the labels are wrong, the learning rate is too aggressive, mixed precision overflow is not handled, or the batch contains values the loss function cannot interpret.
Check the Dataset Before Blaming the Model
With Hugging Face models, the model code is usually the least suspicious part. Start by validating the tensors going into the forward pass.
For sequence classification, label values must be in the expected range. If the model has num_labels=3, then valid labels are 0, 1, and 2. For token classification, masked positions are typically -100, while valid class IDs must still stay within range.
If labels are floats for a classification problem, or if they contain stray values from preprocessing, the loss can become invalid quickly.
A second useful check is to inspect attention masks and padding. A broken collator can produce malformed batches that look structurally correct but create nonsense activations later.
Use a Conservative Training Configuration First
Many NaN issues come from trying to train too aggressively before the baseline run is stable. Start with a small learning rate and standard settings, then speed up only after the loss curve behaves normally.
The important stabilizers here are:
- a low learning rate such as
2e-5 - gradient clipping through
max_grad_norm - disabling mixed precision until the run is known to be stable
If training works in full precision and breaks only with fp16=True, the issue is probably overflow rather than bad labels.
Detect the First Bad Batch
Do not wait until the end of an epoch to investigate. Add checks around the forward pass so you can find the first invalid batch.
This turns a vague training failure into a specific reproducible batch. Once you have that batch, inspect token IDs, labels, sequence lengths, and preprocessing assumptions.
Mixed Precision and Gradient Scaling
Half precision speeds up training, but it narrows the numeric range. Large activations or unstable gradients can overflow to infinity, which then propagates into NaN values.
If you need mixed precision, introduce it only after a stable full-precision run. When using the Hugging Face Trainer, rely on the built-in AMP support rather than custom half-precision code glued onto the loop. If you wrote the loop yourself, use torch.cuda.amp.autocast and a GradScaler instead of manually casting tensors.
A lot of “BERT gives NaN loss” reports come down to turning on fp16 too early and then debugging the wrong layer.
Watch the Optimizer and Initialization Choices
Using the wrong optimizer settings can also destabilize fine-tuning. BERT fine-tuning is usually done with AdamW, low learning rates, and warmup for larger runs. If you replaced the classifier head or added custom layers, confirm those additions are initialized sensibly and match the loss function.
For example:
- use
CrossEntropyLossfor integer class IDs - use
BCEWithLogitsLossfor multi-label targets - do not feed one-hot vectors into the wrong loss by accident
Loss-function mismatch is one of the quickest ways to get useless gradients.
A Practical Debugging Order
A good sequence is:
- run one small batch on CPU or full precision GPU
- verify labels and shapes
- lower the learning rate
- disable
fp16 - add finite-value checks on loss and gradients
- inspect the first bad batch instead of scanning the full dataset blindly
That order usually gets you to the real cause faster than changing many hyperparameters at once.
Common Pitfalls
The most common mistake is assuming NaN means the model architecture is broken. In practice, bad inputs and unstable training settings are far more common.
Another frequent issue is mixing task types. Sequence classification, token classification, regression, and multi-label classification do not use the same target format.
Developers also turn on mixed precision before confirming the baseline run is stable. That hides simpler issues behind overflow symptoms.
Finally, avoid debugging only from aggregate logs. You need the exact batch that first produced a non-finite loss.
Summary
- '
NaNloss usually comes from invalid inputs or unstable training settings, not from BERT itself.' - Check label ranges, tensor shapes, and collator output first.
- Start with a conservative learning rate and full precision.
- Add finite-value checks so you can isolate the first bad batch.
- Only enable mixed precision after the baseline training run is already stable.

