How to perform gradient accumulation WITH distributed training in TF 2.0 / 1.14.0-eager and custom training loop gradient tape?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Gradient accumulation is a common way to simulate a larger batch size when device memory is limited. With TensorFlow custom training loops and distributed strategies, the important part is not just adding gradients together, but doing it in a way that keeps scaling and synchronization correct across replicas.
The core idea
Suppose your model can only fit a micro-batch of 8 examples per GPU, but you want the effect of a batch size of 64. One option is to run 8 forward and backward passes, accumulate gradients, and apply the optimizer once.
In distributed training, each replica computes local gradients first. Those gradients are then reduced across replicas, accumulated for several micro-steps, and finally applied.
A working pattern with MirroredStrategy
The clean pattern is:
- create accumulator variables with the same shapes as the trainable variables
- run one distributed train step per micro-batch
- add each reduced gradient into the accumulators
- divide the loss or gradients so the final update matches the intended effective batch size
- apply and reset after
accum_steps
The training loop then calls distributed_micro_step, adds gradients into the accumulators, and runs apply_accumulated() every accum_steps micro-batches.
Why dividing by accum_steps matters
If you accumulate gradients without scaling, you are effectively multiplying the learning rate by the number of accumulation steps. Sometimes that is intentional, but usually it is not.
The most common fix is to divide the micro-batch loss by accum_steps before taking gradients, as shown above. That keeps the final accumulated gradient comparable to one large batch update.
Handling None gradients safely
Some variables may not receive gradients on a given step. In accumulation code, guard against that explicitly:
That keeps the update loop robust when parts of the model are conditionally used.
What changes in TF 1.14 eager mode
The same principle applies in TensorFlow 1.14 eager execution. The mechanics are a bit more manual, but the algorithm is unchanged:
- compute gradients with
GradientTape - reduce them across replicas if using distribution
- add into accumulator variables
- apply every
accum_steps - zero the accumulators afterward
The main difference is API maturity, not the mathematics.
Common Pitfalls
The biggest mistake is accumulating per-replica gradients without reducing them correctly across devices first. That gives inconsistent scaling and wrong updates.
Another common error is forgetting to divide the loss or gradients by accum_steps. The code runs, but the optimizer sees a much larger update than intended.
It is also easy to forget to reset accumulator variables after apply_gradients, which causes gradients from previous effective batches to leak into later ones.
Finally, do not mix gradient accumulation with batch-statistics assumptions blindly. Layers that depend strongly on micro-batch statistics may still behave differently from true large-batch training.
Summary
- Gradient accumulation simulates a larger effective batch by applying updates less often.
- In distributed TensorFlow, reduce gradients across replicas before accumulating them.
- Divide the loss or gradients by
accum_stepsto preserve expected update scale. - Store accumulated values in non-trainable variables and reset them after each optimizer step.
- The same accumulation logic works in TF 2 custom loops and TF 1.14 eager mode, even though the APIs differ.

