Adam optimizer
training loss
deep learning
model instability
batch processing

Adam optimizer goes haywire after 200k batches, training loss grows

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Adam is usually a strong default optimizer, but late-training instability is a real failure mode. A model can appear healthy for tens or hundreds of thousands of batches and then suddenly start producing larger updates, noisier gradients, or steadily rising training loss. When that happens, the problem is often not "Adam is broken" so much as "the optimizer, learning rate, and numerical regime are no longer a good match for the current training state."

Why Adam Can Destabilize Late in Training

Adam keeps exponential moving averages of both gradients and squared gradients. That is what gives it adaptive step sizes, but it also means the optimizer has its own internal state that evolves over time. If the gradient distribution changes enough, those running estimates can stop being helpful and start amplifying instability.

In practice, late loss growth usually comes from one or more of these:

  • the learning rate is still too high for the current phase of training
  • gradients occasionally spike and Adam reacts badly to them
  • mixed precision or small denominators create numerical issues
  • the data distribution changes across epochs or shards
  • regularization, batch norm, or loss-scaling choices interact badly with long training

The important point is that Adam is not guaranteed to become more stable just because training has run longer.

What To Check First

Before changing the optimizer, confirm whether the issue is actually optimizer state and not a data or implementation bug.

A useful debugging checklist is:

  • does validation loss also rise, or only training loss
  • did the learning rate scheduler change around the same time
  • are there NaNs or Infs in gradients, activations, or loss terms
  • did a new data shard, curriculum stage, or augmentation policy start late
  • are gradient norms exploding intermittently rather than growing smoothly

In PyTorch, logging gradient norms makes this easier:

python
1import torch
2
3def grad_norm(model):
4    total = 0.0
5    for p in model.parameters():
6        if p.grad is not None:
7            total += p.grad.data.norm(2).item() ** 2
8    return total ** 0.5
9
10# inside training loop
11loss.backward()
12print("grad_norm:", grad_norm(model))
13optimizer.step()
14optimizer.zero_grad()

If the norm suddenly jumps when the loss starts rising, that is a strong signal that updates need to be controlled more tightly.

The Most Common Fixes

Lower the learning rate

The simplest fix is often the correct one. Adam tolerates larger learning rates than plain SGD in some setups, but a rate that is fine early can still be too aggressive later.

python
1optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
2scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
3    optimizer, mode="min", factor=0.5, patience=3
4)

If training destabilizes after a long plateau, decaying the learning rate is usually a better first move than changing everything at once.

Add gradient clipping

Gradient clipping is one of the most reliable protections against sudden bad steps.

python
1loss.backward()
2torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
3optimizer.step()
4optimizer.zero_grad()

Clipping does not fix the root cause, but it can stop rare spikes from wrecking the optimizer state.

Increase eps slightly

If the issue is numerical, especially in mixed-precision or very small-gradient regimes, increasing Adam's eps can help stabilize the denominator.

python
1optimizer = torch.optim.Adam(
2    model.parameters(),
3    lr=1e-4,
4    betas=(0.9, 0.999),
5    eps=1e-7,
6)

This is not always necessary, but it is a reasonable targeted change when the failure looks numerical rather than purely optimization-related.

When To Change the Optimizer

If lower learning rate, clipping, and basic numeric checks do not help, the next step is often to try a variant rather than abandoning adaptive optimization entirely.

Two common options are:

  • AdamW, which decouples weight decay from the adaptive update
  • AMSGrad, which was proposed partly to address convergence pathologies in Adam

Example:

python
1optimizer = torch.optim.AdamW(
2    model.parameters(),
3    lr=1e-4,
4    weight_decay=1e-4,
5)

For some models, especially transformer-style architectures, AdamW is a better default than classic Adam anyway.

Another practical strategy is optimizer reset: keep the learned weights, but recreate the optimizer state after reducing the learning rate. That is sometimes enough when the model weights are good but Adam's accumulated moments have become unhelpful.

Watch the Data Pipeline Too

A lot of "optimizer goes haywire" reports are actually late-appearing data issues. Examples include:

  • one corrupted shard enters the rotation after many steps
  • augmentation occasionally produces extreme samples
  • label noise is concentrated in a later subset
  • curriculum learning changes the batch difficulty abruptly

If the instability starts around the same batch every run, that is a strong hint to inspect the input pipeline, not only the optimizer hyperparameters.

Common Pitfalls

The biggest mistake is changing five things at once. If you lower the learning rate, switch to AdamW, add clipping, and modify the scheduler together, you will not know what actually fixed the problem.

Another mistake is assuming rising training loss always means overfitting. Overfitting usually shows up first in validation behavior; rising training loss often points to optimization or data instability instead.

People also often ignore optimizer state. Adam is stateful, so restarting training from weights alone versus restoring the full optimizer can lead to very different behavior.

Finally, do not overlook numerics. A single hidden NaN, poor mixed-precision scaling choice, or unstable custom loss term can make Adam look guilty when the real issue is elsewhere.

Summary

  • Late-training loss growth with Adam is a real pattern, not just user error.
  • The first suspects are learning rate, gradient spikes, numeric stability, and data shifts.
  • Lowering the learning rate and adding gradient clipping are the most common first fixes.
  • AdamW, AMSGrad, or resetting optimizer state can help when classic Adam remains unstable.
  • If the failure starts at a repeatable point, inspect the data pipeline as carefully as the optimizer.

Course illustration
Course illustration

All Rights Reserved.