Adam optimizer with warmup on PyTorch

Adam optimizer

warmup

PyTorch

machine learning

deep learning

Adam optimizer with warmup on PyTorch

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Warmup means starting training with a smaller learning rate and increasing it gradually over the first part of training. In PyTorch, that pairs well with Adam or AdamW when the target learning rate is large enough that jumping there immediately would make early optimization unstable.

Why Warmup Helps Adam

Adam already adapts parameter-wise step sizes, so warmup is not mandatory for every project. But it often helps when:

the model is large
gradients are noisy at the beginning
the target learning rate is aggressive
training uses transformers or other architectures that benefit from careful early optimization

The idea is simple: let the optimizer "settle in" before using the full step size.

Basic AdamW Setup in PyTorch

PyTorch separates the optimizer from the scheduler. You create Adam or AdamW first, then attach a scheduler that controls the learning rate over time.

python

1import torch
2from torch import nn
3from torch.optim import AdamW
4
5model = nn.Linear(128, 10)
6optimizer = AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

The optimizer holds the target base learning rate. Warmup is implemented by scheduling smaller values for the early steps.

Linear Warmup with Built-In Schedulers

One clean approach is to use LinearLR for warmup and then hand off to another scheduler such as cosine annealing.

python

1import torch
2from torch import nn
3from torch.optim import AdamW
4from torch.optim.lr_scheduler import LinearLR, CosineAnnealingLR, SequentialLR
5
6model = nn.Linear(128, 10)
7optimizer = AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
8
9warmup_scheduler = LinearLR(
10    optimizer,
11    start_factor=0.1,
12    end_factor=1.0,
13    total_iters=500,
14)
15
16decay_scheduler = CosineAnnealingLR(
17    optimizer,
18    T_max=9500,
19)
20
21scheduler = SequentialLR(
22    optimizer,
23    schedulers=[warmup_scheduler, decay_scheduler],
24    milestones=[500],
25)

In this example, the learning rate starts at ten percent of the base value, rises linearly over 500 optimizer steps, and then transitions into cosine decay.

Training Loop Placement

The order of calls matters. A typical loop looks like this:

python

1for inputs, targets in train_loader:
2    optimizer.zero_grad()
3    outputs = model(inputs)
4    loss = loss_fn(outputs, targets)
5    loss.backward()
6    optimizer.step()
7    scheduler.step()

Calling scheduler.step() after optimizer.step() is the common pattern for step-based training schedules.

If you log the learning rate during training, you can see the warmup happen explicitly:

python

current_lr = optimizer.param_groups[0]["lr"]
print(current_lr)

A Custom Lambda Warmup

If built-in schedulers do not match the schedule you want, LambdaLR is a flexible option.

python

1from torch.optim.lr_scheduler import LambdaLR
2
3warmup_steps = 1000
4
5def lr_lambda(step):
6    if step < warmup_steps:
7        return float(step + 1) / float(warmup_steps)
8    return 1.0
9
10scheduler = LambdaLR(optimizer, lr_lambda=lr_lambda)

This version performs pure warmup and then keeps the base learning rate fixed.

Choosing Warmup Length

There is no universal number of warmup steps. A short warmup may be enough for small models, while larger transformer-style training jobs often use a more noticeable warmup phase.

The practical way to choose is to watch early training stability:

if loss spikes immediately, warmup may be too short or the base LR too high
if training starts very slowly, warmup may be too long

Warmup is not a substitute for a sensible base learning rate. It only smooths the transition into it.

Common Pitfalls

The biggest pitfall is forgetting that the scheduler operates on the optimizer's parameter groups. If you inspect the wrong learning-rate value or step the scheduler at the wrong time, the schedule may not match your intent.

Another issue is combining warmup with a very small base learning rate. In that case, warmup may only slow training without improving stability.

Developers also sometimes say "Adam with warmup" while actually using decoupled weight decay behavior from AdamW. That is fine in practice, but it is worth being precise because the optimizer variants are not identical.

Finally, be consistent about whether warmup is defined in steps or epochs. Most modern implementations use steps, which gives smoother control when batch counts are large.

Summary

Warmup gradually increases the learning rate during the first part of training.
In PyTorch, Adam or AdamW warmup is usually implemented with a scheduler, not a special optimizer class.
'LinearLR, LambdaLR, and SequentialLR are common building blocks.'
Step the optimizer first and then the scheduler in a standard step-based loop.
Warmup helps with early stability, but it still depends on choosing a sensible base learning rate.