How to accumulate gradients for large batch sizes in Keras
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Gradient accumulation lets you simulate a larger batch size by summing gradients across several smaller mini-batches before applying an optimizer step. This is useful when the effective batch size you want does not fit into GPU memory in one forward and backward pass.
Why Gradient Accumulation Works
Suppose you want an effective batch size of 256, but memory only allows batches of 32. You can:
- run 8 mini-batches of size
32 - accumulate their gradients
- apply one optimizer update after those 8 steps
That approximates the effect of training on a batch of 256, while keeping peak memory usage close to the smaller batch size.
The key implementation detail is that the loss should usually be normalized consistently so the accumulated gradient scale matches the intended effective batch.
A Custom Keras Model With Accumulation
In modern Keras, the cleanest approach is to override train_step.
This model divides the loss by accumulation_steps, accumulates gradients, and applies them only after the chosen number of mini-batches.
A Minimal Usage Example
Here the mini-batch size is 8, but the effective batch size is closer to 32 because four gradient steps are accumulated before each optimizer update.
Things That Still Change
Gradient accumulation is not always identical to true large-batch training. Differences can appear when your model uses:
- batch normalization
- gradient clipping
- adaptive optimizer internals
- data augmentation randomness per mini-batch
So the technique is extremely useful, but it is best understood as a close approximation rather than a magical equivalence in every detail.
When to Use It
Use gradient accumulation when:
- GPU memory is the limiting factor
- you want a larger effective batch for stability or throughput reasons
- changing model size is less desirable than changing update frequency
It is especially common in large language models, high-resolution vision training, and any workload where activation memory dominates.
Common Pitfalls
The biggest pitfall is forgetting to scale the loss or gradients properly. If you accumulate raw gradients without accounting for the number of mini-batches, the effective update magnitude can be too large.
Another common mistake is thinking gradient accumulation fixes all large-batch concerns. It solves memory pressure, but it does not erase every optimization or normalization difference between small and truly large batches.
Developers also sometimes forget to zero the accumulation buffers after applying gradients, which causes updates to include stale gradients from previous cycles.
Summary
- Gradient accumulation simulates a larger effective batch size using multiple smaller mini-batches.
- In Keras, overriding
train_stepis a clean way to implement it. - Divide the loss consistently so the accumulated update scale matches the intended batch behavior.
- The method reduces memory pressure but is not always perfectly identical to true large-batch training.
- Reset accumulated gradients after each optimizer step to avoid corrupt updates.

