Adam optimizer goes haywire after 200k batches, training loss grows
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Adam is usually a strong default optimizer, but late-training instability is a real failure mode. A model can appear healthy for tens or hundreds of thousands of batches and then suddenly start producing larger updates, noisier gradients, or steadily rising training loss. When that happens, the problem is often not "Adam is broken" so much as "the optimizer, learning rate, and numerical regime are no longer a good match for the current training state."
Why Adam Can Destabilize Late in Training
Adam keeps exponential moving averages of both gradients and squared gradients. That is what gives it adaptive step sizes, but it also means the optimizer has its own internal state that evolves over time. If the gradient distribution changes enough, those running estimates can stop being helpful and start amplifying instability.
In practice, late loss growth usually comes from one or more of these:
- the learning rate is still too high for the current phase of training
- gradients occasionally spike and Adam reacts badly to them
- mixed precision or small denominators create numerical issues
- the data distribution changes across epochs or shards
- regularization, batch norm, or loss-scaling choices interact badly with long training
The important point is that Adam is not guaranteed to become more stable just because training has run longer.
What To Check First
Before changing the optimizer, confirm whether the issue is actually optimizer state and not a data or implementation bug.
A useful debugging checklist is:
- does validation loss also rise, or only training loss
- did the learning rate scheduler change around the same time
- are there NaNs or Infs in gradients, activations, or loss terms
- did a new data shard, curriculum stage, or augmentation policy start late
- are gradient norms exploding intermittently rather than growing smoothly
In PyTorch, logging gradient norms makes this easier:
If the norm suddenly jumps when the loss starts rising, that is a strong signal that updates need to be controlled more tightly.
The Most Common Fixes
Lower the learning rate
The simplest fix is often the correct one. Adam tolerates larger learning rates than plain SGD in some setups, but a rate that is fine early can still be too aggressive later.
If training destabilizes after a long plateau, decaying the learning rate is usually a better first move than changing everything at once.
Add gradient clipping
Gradient clipping is one of the most reliable protections against sudden bad steps.
Clipping does not fix the root cause, but it can stop rare spikes from wrecking the optimizer state.
Increase eps slightly
If the issue is numerical, especially in mixed-precision or very small-gradient regimes, increasing Adam's eps can help stabilize the denominator.
This is not always necessary, but it is a reasonable targeted change when the failure looks numerical rather than purely optimization-related.
When To Change the Optimizer
If lower learning rate, clipping, and basic numeric checks do not help, the next step is often to try a variant rather than abandoning adaptive optimization entirely.
Two common options are:
AdamW, which decouples weight decay from the adaptive updateAMSGrad, which was proposed partly to address convergence pathologies in Adam
Example:
For some models, especially transformer-style architectures, AdamW is a better default than classic Adam anyway.
Another practical strategy is optimizer reset: keep the learned weights, but recreate the optimizer state after reducing the learning rate. That is sometimes enough when the model weights are good but Adam's accumulated moments have become unhelpful.
Watch the Data Pipeline Too
A lot of "optimizer goes haywire" reports are actually late-appearing data issues. Examples include:
- one corrupted shard enters the rotation after many steps
- augmentation occasionally produces extreme samples
- label noise is concentrated in a later subset
- curriculum learning changes the batch difficulty abruptly
If the instability starts around the same batch every run, that is a strong hint to inspect the input pipeline, not only the optimizer hyperparameters.
Common Pitfalls
The biggest mistake is changing five things at once. If you lower the learning rate, switch to AdamW, add clipping, and modify the scheduler together, you will not know what actually fixed the problem.
Another mistake is assuming rising training loss always means overfitting. Overfitting usually shows up first in validation behavior; rising training loss often points to optimization or data instability instead.
People also often ignore optimizer state. Adam is stateful, so restarting training from weights alone versus restoring the full optimizer can lead to very different behavior.
Finally, do not overlook numerics. A single hidden NaN, poor mixed-precision scaling choice, or unstable custom loss term can make Adam look guilty when the real issue is elsewhere.
Summary
- Late-training loss growth with Adam is a real pattern, not just user error.
- The first suspects are learning rate, gradient spikes, numeric stability, and data shifts.
- Lowering the learning rate and adding gradient clipping are the most common first fixes.
AdamW,AMSGrad, or resetting optimizer state can help when classic Adam remains unstable.- If the failure starts at a repeatable point, inspect the data pipeline as carefully as the optimizer.

