What is the proper way to weight decay for Adam Optimizer

Adam Optimizer

weight decay

machine learning

optimization techniques

deep learning

What is the proper way to weight decay for Adam Optimizer

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

With SGD, people often treat L2 regularization and weight decay as almost interchangeable. With Adam, that shortcut breaks down: the proper modern approach is usually decoupled weight decay, commonly known as AdamW, rather than adding an L2 penalty term directly into the adaptive gradient update.

Why naive L2 is not the same thing in Adam

Classic L2 regularization adds lambda * w to the gradient. In plain SGD, that ends up behaving similarly to weight decay.

Adam is different because it rescales parameter updates using adaptive first and second moments. Once you inject the L2 term into the gradient, that regularization term also gets adapted by Adam's per-parameter scaling.

That means the effect is no longer the simple shrinkage most people mean by "weight decay."

Decoupled weight decay

The decoupled idea is simple: apply Adam's gradient-based update as usual, then shrink the weights separately.

Conceptually:

Adam step handles the gradient
weight decay step multiplies or subtracts directly from the weights

That separation preserves the intended behavior of weight decay more faithfully.

PyTorch example

In modern PyTorch, the clean answer is usually AdamW.

python

1import torch
2
3model = torch.nn.Linear(10, 1)
4optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
5
6x = torch.randn(8, 10)
7y = torch.randn(8, 1)
8
9prediction = model(x)
10loss = torch.nn.functional.mse_loss(prediction, y)
11loss.backward()
12optimizer.step()
13optimizer.zero_grad()

This uses decoupled weight decay rather than smuggling the penalty through Adam's gradient moments.

Keras example

Keras has the same idea available through optimizers that support weight decay explicitly.

python

1import tensorflow as tf
2
3model = tf.keras.Sequential([
4    tf.keras.layers.Dense(32, activation="relu"),
5    tf.keras.layers.Dense(1),
6])
7
8optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=1e-4)
9model.compile(optimizer=optimizer, loss="mse")

Again, the key point is explicit weight decay, not just a loss penalty added indirectly.

What about layer-level regularizers

You can still use layer regularizers such as kernel_regularizer in Keras. Those are useful and sometimes intentional. But if your goal is specifically "proper Adam-style weight decay," decoupled optimizer support is usually the right tool.

These approaches are related, but not conceptually identical.

Which parameters should decay

In deep-learning practice, not every parameter is always decayed. Biases and normalization parameters are often excluded from weight decay, while weight matrices are decayed.

Framework defaults vary, so if this detail matters for your training setup, verify which parameter groups are actually receiving decay.

Tuning advice

The right decay value depends on model size, batch size, learning rate, and dataset size. There is no universal best coefficient.

What matters is that once you decide to use weight decay with Adam, you apply it through the correct mechanism and then tune it empirically rather than assuming old SGD heuristics transfer unchanged.

Common Pitfalls

A common mistake is saying "I added L2 to Adam, so I am doing weight decay." In adaptive optimizers, that is not the same as decoupled decay.

Another issue is decaying every parameter indiscriminately, including biases or normalization scales, without checking whether that harms training.

It is also easy to compare results across frameworks without verifying whether one experiment used true AdamW and the other used L2 regularization in the loss.

Summary

For Adam, the proper modern weight-decay approach is usually decoupled decay, as in AdamW.
Naive L2 regularization and weight decay are not equivalent under adaptive updates.
Use optimizer APIs that expose weight decay directly when possible.
Be deliberate about which parameter groups should decay.
Tune the decay coefficient empirically instead of copying SGD-era defaults blindly.