Adam optimizer with warmup on PyTorch
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Warmup means starting training with a smaller learning rate and increasing it gradually over the first part of training. In PyTorch, that pairs well with Adam or AdamW when the target learning rate is large enough that jumping there immediately would make early optimization unstable.
Why Warmup Helps Adam
Adam already adapts parameter-wise step sizes, so warmup is not mandatory for every project. But it often helps when:
- the model is large
- gradients are noisy at the beginning
- the target learning rate is aggressive
- training uses transformers or other architectures that benefit from careful early optimization
The idea is simple: let the optimizer "settle in" before using the full step size.
Basic AdamW Setup in PyTorch
PyTorch separates the optimizer from the scheduler. You create Adam or AdamW first, then attach a scheduler that controls the learning rate over time.
The optimizer holds the target base learning rate. Warmup is implemented by scheduling smaller values for the early steps.
Linear Warmup with Built-In Schedulers
One clean approach is to use LinearLR for warmup and then hand off to another scheduler such as cosine annealing.
In this example, the learning rate starts at ten percent of the base value, rises linearly over 500 optimizer steps, and then transitions into cosine decay.
Training Loop Placement
The order of calls matters. A typical loop looks like this:
Calling scheduler.step() after optimizer.step() is the common pattern for step-based training schedules.
If you log the learning rate during training, you can see the warmup happen explicitly:
A Custom Lambda Warmup
If built-in schedulers do not match the schedule you want, LambdaLR is a flexible option.
This version performs pure warmup and then keeps the base learning rate fixed.
Choosing Warmup Length
There is no universal number of warmup steps. A short warmup may be enough for small models, while larger transformer-style training jobs often use a more noticeable warmup phase.
The practical way to choose is to watch early training stability:
- if loss spikes immediately, warmup may be too short or the base LR too high
- if training starts very slowly, warmup may be too long
Warmup is not a substitute for a sensible base learning rate. It only smooths the transition into it.
Common Pitfalls
The biggest pitfall is forgetting that the scheduler operates on the optimizer's parameter groups. If you inspect the wrong learning-rate value or step the scheduler at the wrong time, the schedule may not match your intent.
Another issue is combining warmup with a very small base learning rate. In that case, warmup may only slow training without improving stability.
Developers also sometimes say "Adam with warmup" while actually using decoupled weight decay behavior from AdamW. That is fine in practice, but it is worth being precise because the optimizer variants are not identical.
Finally, be consistent about whether warmup is defined in steps or epochs. Most modern implementations use steps, which gives smoother control when batch counts are large.
Summary
- Warmup gradually increases the learning rate during the first part of training.
- In PyTorch, Adam or AdamW warmup is usually implemented with a scheduler, not a special optimizer class.
- '
LinearLR,LambdaLR, andSequentialLRare common building blocks.' - Step the optimizer first and then the scheduler in a standard step-based loop.
- Warmup helps with early stability, but it still depends on choosing a sensible base learning rate.

