Adam optimizer goes haywire after 200k batches, training loss grows

Adam optimizer

training loss

deep learning

model instability

batch processing

Adam optimizer goes haywire after 200k batches, training loss grows

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Adam is usually a strong default optimizer, but late-training instability is a real failure mode. A model can appear healthy for tens or hundreds of thousands of batches and then suddenly start producing larger updates, noisier gradients, or steadily rising training loss. When that happens, the problem is often not "Adam is broken" so much as "the optimizer, learning rate, and numerical regime are no longer a good match for the current training state."

Why Adam Can Destabilize Late in Training

Adam keeps exponential moving averages of both gradients and squared gradients. That is what gives it adaptive step sizes, but it also means the optimizer has its own internal state that evolves over time. If the gradient distribution changes enough, those running estimates can stop being helpful and start amplifying instability.

In practice, late loss growth usually comes from one or more of these:

the learning rate is still too high for the current phase of training
gradients occasionally spike and Adam reacts badly to them
mixed precision or small denominators create numerical issues
the data distribution changes across epochs or shards
regularization, batch norm, or loss-scaling choices interact badly with long training

The important point is that Adam is not guaranteed to become more stable just because training has run longer.

What To Check First

Before changing the optimizer, confirm whether the issue is actually optimizer state and not a data or implementation bug.

A useful debugging checklist is:

does validation loss also rise, or only training loss
did the learning rate scheduler change around the same time
are there NaNs or Infs in gradients, activations, or loss terms
did a new data shard, curriculum stage, or augmentation policy start late
are gradient norms exploding intermittently rather than growing smoothly

In PyTorch, logging gradient norms makes this easier:

python

1import torch
2
3def grad_norm(model):
4    total = 0.0
5    for p in model.parameters():
6        if p.grad is not None:
7            total += p.grad.data.norm(2).item() ** 2
8    return total ** 0.5
9
10# inside training loop
11loss.backward()
12print("grad_norm:", grad_norm(model))
13optimizer.step()
14optimizer.zero_grad()

If the norm suddenly jumps when the loss starts rising, that is a strong signal that updates need to be controlled more tightly.

The Most Common Fixes

Lower the learning rate

The simplest fix is often the correct one. Adam tolerates larger learning rates than plain SGD in some setups, but a rate that is fine early can still be too aggressive later.

python

1optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
2scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
3    optimizer, mode="min", factor=0.5, patience=3
4)

If training destabilizes after a long plateau, decaying the learning rate is usually a better first move than changing everything at once.

Add gradient clipping

Gradient clipping is one of the most reliable protections against sudden bad steps.

python

1loss.backward()
2torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
3optimizer.step()
4optimizer.zero_grad()

Clipping does not fix the root cause, but it can stop rare spikes from wrecking the optimizer state.

Increase `eps` slightly

If the issue is numerical, especially in mixed-precision or very small-gradient regimes, increasing Adam's eps can help stabilize the denominator.

python

1optimizer = torch.optim.Adam(
2    model.parameters(),
3    lr=1e-4,
4    betas=(0.9, 0.999),
5    eps=1e-7,
6)

This is not always necessary, but it is a reasonable targeted change when the failure looks numerical rather than purely optimization-related.

When To Change the Optimizer

If lower learning rate, clipping, and basic numeric checks do not help, the next step is often to try a variant rather than abandoning adaptive optimization entirely.

Two common options are:

AdamW, which decouples weight decay from the adaptive update
AMSGrad, which was proposed partly to address convergence pathologies in Adam

Example:

python

1optimizer = torch.optim.AdamW(
2    model.parameters(),
3    lr=1e-4,
4    weight_decay=1e-4,
5)

For some models, especially transformer-style architectures, AdamW is a better default than classic Adam anyway.

Another practical strategy is optimizer reset: keep the learned weights, but recreate the optimizer state after reducing the learning rate. That is sometimes enough when the model weights are good but Adam's accumulated moments have become unhelpful.

Watch the Data Pipeline Too

A lot of "optimizer goes haywire" reports are actually late-appearing data issues. Examples include:

one corrupted shard enters the rotation after many steps
augmentation occasionally produces extreme samples
label noise is concentrated in a later subset
curriculum learning changes the batch difficulty abruptly

If the instability starts around the same batch every run, that is a strong hint to inspect the input pipeline, not only the optimizer hyperparameters.

Common Pitfalls

The biggest mistake is changing five things at once. If you lower the learning rate, switch to AdamW, add clipping, and modify the scheduler together, you will not know what actually fixed the problem.

Another mistake is assuming rising training loss always means overfitting. Overfitting usually shows up first in validation behavior; rising training loss often points to optimization or data instability instead.

People also often ignore optimizer state. Adam is stateful, so restarting training from weights alone versus restoring the full optimizer can lead to very different behavior.

Finally, do not overlook numerics. A single hidden NaN, poor mixed-precision scaling choice, or unstable custom loss term can make Adam look guilty when the real issue is elsewhere.

Summary

Late-training loss growth with Adam is a real pattern, not just user error.
The first suspects are learning rate, gradient spikes, numeric stability, and data shifts.
Lowering the learning rate and adding gradient clipping are the most common first fixes.
AdamW, AMSGrad, or resetting optimizer state can help when classic Adam remains unstable.
If the failure starts at a repeatable point, inspect the data pipeline as carefully as the optimizer.

Adam optimizer goes haywire after 200k batches, training loss grows

Master System Design with Codemia

Introduction

Why Adam Can Destabilize Late in Training

What To Check First

The Most Common Fixes

Lower the learning rate

Add gradient clipping

Increase eps slightly

When To Change the Optimizer

Watch the Data Pipeline Too

Common Pitfalls

Summary

Increase `eps` slightly