BERT HuggingFace gives NaN `Loss`

BERT

HuggingFace

NaN `Loss`

Machine Learning

Troubleshooting

BERT HuggingFace gives NaN `Loss`

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A NaN loss during BERT training is almost never “random.” It usually means one part of the training pipeline has already gone numerically invalid: the labels are wrong, the learning rate is too aggressive, mixed precision overflow is not handled, or the batch contains values the loss function cannot interpret.

Check the Dataset Before Blaming the Model

With Hugging Face models, the model code is usually the least suspicious part. Start by validating the tensors going into the forward pass.

For sequence classification, label values must be in the expected range. If the model has num_labels=3, then valid labels are 0, 1, and 2. For token classification, masked positions are typically -100, while valid class IDs must still stay within range.

python

1import torch
2from transformers import AutoTokenizer
3
4
5tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
6texts = ["good movie", "bad movie"]
7labels = torch.tensor([1, 0])
8
9batch = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
10
11assert torch.isfinite(batch["input_ids"]).all()
12assert labels.min().item() >= 0
13assert labels.max().item() < 2

If labels are floats for a classification problem, or if they contain stray values from preprocessing, the loss can become invalid quickly.

A second useful check is to inspect attention masks and padding. A broken collator can produce malformed batches that look structurally correct but create nonsense activations later.

Use a Conservative Training Configuration First

Many NaN issues come from trying to train too aggressively before the baseline run is stable. Start with a small learning rate and standard settings, then speed up only after the loss curve behaves normally.

python

1from transformers import (
2    AutoModelForSequenceClassification,
3    Trainer,
4    TrainingArguments,
5)
6
7model = AutoModelForSequenceClassification.from_pretrained(
8    "bert-base-uncased",
9    num_labels=2,
10)
11
12args = TrainingArguments(
13    output_dir="./model-output",
14    learning_rate=2e-5,
15    per_device_train_batch_size=8,
16    num_train_epochs=2,
17    weight_decay=0.01,
18    max_grad_norm=1.0,
19    fp16=False,
20    logging_steps=10,
21)

The important stabilizers here are:

a low learning rate such as 2e-5
gradient clipping through max_grad_norm
disabling mixed precision until the run is known to be stable

If training works in full precision and breaks only with fp16=True, the issue is probably overflow rather than bad labels.

Detect the First Bad Batch

Do not wait until the end of an epoch to investigate. Add checks around the forward pass so you can find the first invalid batch.

python

1model.train()
2optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
3
4for step, batch in enumerate(train_dataloader):
5    batch = {k: v.to(model.device) for k, v in batch.items()}
6    outputs = model(**batch)
7    loss = outputs.loss
8
9    if not torch.isfinite(loss):
10        print("Bad batch at step", step)
11        for key, value in batch.items():
12            print(key, value[:2])
13        break
14
15    loss.backward()
16    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
17    optimizer.step()
18    optimizer.zero_grad()

This turns a vague training failure into a specific reproducible batch. Once you have that batch, inspect token IDs, labels, sequence lengths, and preprocessing assumptions.

Mixed Precision and Gradient Scaling

Half precision speeds up training, but it narrows the numeric range. Large activations or unstable gradients can overflow to infinity, which then propagates into NaN values.

If you need mixed precision, introduce it only after a stable full-precision run. When using the Hugging Face Trainer, rely on the built-in AMP support rather than custom half-precision code glued onto the loop. If you wrote the loop yourself, use torch.cuda.amp.autocast and a GradScaler instead of manually casting tensors.

A lot of “BERT gives NaN loss” reports come down to turning on fp16 too early and then debugging the wrong layer.

Watch the Optimizer and Initialization Choices

Using the wrong optimizer settings can also destabilize fine-tuning. BERT fine-tuning is usually done with AdamW, low learning rates, and warmup for larger runs. If you replaced the classifier head or added custom layers, confirm those additions are initialized sensibly and match the loss function.

For example:

use CrossEntropyLoss for integer class IDs
use BCEWithLogitsLoss for multi-label targets
do not feed one-hot vectors into the wrong loss by accident

Loss-function mismatch is one of the quickest ways to get useless gradients.

A Practical Debugging Order

A good sequence is:

run one small batch on CPU or full precision GPU
verify labels and shapes
lower the learning rate
disable fp16
add finite-value checks on loss and gradients
inspect the first bad batch instead of scanning the full dataset blindly

That order usually gets you to the real cause faster than changing many hyperparameters at once.

Common Pitfalls

The most common mistake is assuming NaN means the model architecture is broken. In practice, bad inputs and unstable training settings are far more common.

Another frequent issue is mixing task types. Sequence classification, token classification, regression, and multi-label classification do not use the same target format.

Developers also turn on mixed precision before confirming the baseline run is stable. That hides simpler issues behind overflow symptoms.

Finally, avoid debugging only from aggregate logs. You need the exact batch that first produced a non-finite loss.

Summary

'NaN loss usually comes from invalid inputs or unstable training settings, not from BERT itself.'
Check label ranges, tensor shapes, and collator output first.
Start with a conservative learning rate and full precision.
Add finite-value checks so you can isolate the first bad batch.
Only enable mixed precision after the baseline training run is already stable.