How to accumulate gradients in tensorflow?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Gradient accumulation lets you simulate a larger effective batch size without loading that full batch into memory at once. Instead of applying gradients after every mini-batch, you sum gradients across several smaller batches and update the model only after the configured number of accumulation steps.
This is useful when the batch size you want for training stability does not fit in GPU memory. The key detail is that you are not skipping optimization. You are delaying it until enough mini-batches have contributed to the gradient.
When gradient accumulation helps
Suppose you want an effective batch size of 128, but your GPU can only fit 32 examples at a time. You can process four mini-batches of size 32, accumulate gradients from each one, and apply the combined update once.
In effect:
- mini-batch size:
32 - accumulation steps:
4 - effective batch size:
128
This does not make training identical to a true batch of 128 in every possible setup, but it is often close enough to be very useful.
Accumulate gradients with GradientTape
In TensorFlow 2, the most direct approach is to use a custom training loop with tf.GradientTape.
The gradient_buffer stores the running sum of gradients between optimizer updates.
Scale the loss and apply gradients periodically
Inside the training loop, compute gradients for each mini-batch and add them into the buffer. A common pattern is to divide the loss by accumulation_steps so the final accumulated gradient has the same scale as one large averaged batch.
Then call train_step as you iterate through the dataset:
This applies the optimizer update only every fourth mini-batch.
Handle the last partial accumulation window
One easy mistake is forgetting the final incomplete group of mini-batches. If the epoch length is not divisible by accumulation_steps, you still need to apply whatever gradients remain in the buffer at the end.
In a Python-driven loop, you can track that explicitly:
Without this final flush, the last few mini-batches of each epoch are ignored.
Think about learning rate and batch-dependent layers
Gradient accumulation changes the effective batch size, so optimizer behavior may shift. In some training setups, you may need to revisit the learning rate because the optimization step now reflects more examples per update.
Also remember that layers such as batch normalization still see the physical mini-batch size, not the accumulated effective batch size. Gradient accumulation helps optimizer statistics, but it does not magically change batch-dependent layer behavior.
Common Pitfalls
The most common mistake is accumulating gradients without resetting the buffer after apply_gradients. That causes gradients to keep growing across unrelated updates.
Another issue is forgetting to scale the loss or otherwise account for how many mini-batches contributed to the final update. If you do not normalize correctly, the effective update magnitude changes.
Developers also often forget the final partial accumulation window at the end of an epoch, which silently drops training signal.
Finally, gradient accumulation is not the same as batch normalization over a larger batch. It helps optimizer updates, but it does not change every batch-sensitive component in the model.
Summary
- Gradient accumulation simulates a larger effective batch size by delaying optimizer updates.
- In TensorFlow, a custom
GradientTapeloop is the most direct way to implement it. - Accumulate gradients into a buffer, apply them every
accumulation_steps, then reset the buffer. - Do not forget to flush the final partial buffer at the end of the epoch.
- Revisit learning rate and batch-sensitive layers when changing effective batch size.

