gradient accumulation
distributed training
TensorFlow 2.0
custom training loop
gradient tape

How to perform gradient accumulation WITH distributed training in TF 2.0 / 1.14.0-eager and custom training loop gradient tape?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Gradient accumulation is a common way to simulate a larger batch size when device memory is limited. With TensorFlow custom training loops and distributed strategies, the important part is not just adding gradients together, but doing it in a way that keeps scaling and synchronization correct across replicas.

The core idea

Suppose your model can only fit a micro-batch of 8 examples per GPU, but you want the effect of a batch size of 64. One option is to run 8 forward and backward passes, accumulate gradients, and apply the optimizer once.

In distributed training, each replica computes local gradients first. Those gradients are then reduced across replicas, accumulated for several micro-steps, and finally applied.

A working pattern with MirroredStrategy

The clean pattern is:

  1. create accumulator variables with the same shapes as the trainable variables
  2. run one distributed train step per micro-batch
  3. add each reduced gradient into the accumulators
  4. divide the loss or gradients so the final update matches the intended effective batch size
  5. apply and reset after accum_steps
python
1import tensorflow as tf
2
3strategy = tf.distribute.MirroredStrategy()
4accum_steps = 4
5
6with strategy.scope():
7    model = tf.keras.Sequential([
8        tf.keras.layers.Dense(32, activation="relu"),
9        tf.keras.layers.Dense(1),
10    ])
11    optimizer = tf.keras.optimizers.Adam(1e-3)
12    loss_fn = tf.keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)
13
14    accumulators = [
15        tf.Variable(tf.zeros_like(v), trainable=False)
16        for v in model.trainable_variables
17    ]
18
19
20def compute_loss(labels, predictions):
21    per_example = loss_fn(labels, predictions)
22    return tf.nn.compute_average_loss(per_example)
23
24
25@tf.function
26def distributed_micro_step(dist_inputs):
27    def replica_step(inputs):
28        features, labels = inputs
29        with tf.GradientTape() as tape:
30            predictions = model(features, training=True)
31            loss = compute_loss(labels, predictions) / accum_steps
32        grads = tape.gradient(loss, model.trainable_variables)
33        return grads, loss
34
35    per_replica_grads, per_replica_loss = strategy.run(replica_step, args=(dist_inputs,))
36
37    reduced_grads = []
38    for grads_for_var in zip(*strategy.experimental_local_results(per_replica_grads)):
39        reduced_grads.append(strategy.reduce(tf.distribute.ReduceOp.SUM, grads_for_var, axis=None))
40
41    loss = strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_loss, axis=None)
42    return reduced_grads, loss
43
44
45@tf.function
46def apply_accumulated():
47    optimizer.apply_gradients(zip(accumulators, model.trainable_variables))
48    for acc in accumulators:
49        acc.assign(tf.zeros_like(acc))

The training loop then calls distributed_micro_step, adds gradients into the accumulators, and runs apply_accumulated() every accum_steps micro-batches.

Why dividing by accum_steps matters

If you accumulate gradients without scaling, you are effectively multiplying the learning rate by the number of accumulation steps. Sometimes that is intentional, but usually it is not.

The most common fix is to divide the micro-batch loss by accum_steps before taking gradients, as shown above. That keeps the final accumulated gradient comparable to one large batch update.

Handling None gradients safely

Some variables may not receive gradients on a given step. In accumulation code, guard against that explicitly:

python
for acc, grad in zip(accumulators, reduced_grads):
    if grad is not None:
        acc.assign_add(grad)

That keeps the update loop robust when parts of the model are conditionally used.

What changes in TF 1.14 eager mode

The same principle applies in TensorFlow 1.14 eager execution. The mechanics are a bit more manual, but the algorithm is unchanged:

  • compute gradients with GradientTape
  • reduce them across replicas if using distribution
  • add into accumulator variables
  • apply every accum_steps
  • zero the accumulators afterward

The main difference is API maturity, not the mathematics.

Common Pitfalls

The biggest mistake is accumulating per-replica gradients without reducing them correctly across devices first. That gives inconsistent scaling and wrong updates.

Another common error is forgetting to divide the loss or gradients by accum_steps. The code runs, but the optimizer sees a much larger update than intended.

It is also easy to forget to reset accumulator variables after apply_gradients, which causes gradients from previous effective batches to leak into later ones.

Finally, do not mix gradient accumulation with batch-statistics assumptions blindly. Layers that depend strongly on micro-batch statistics may still behave differently from true large-batch training.

Summary

  • Gradient accumulation simulates a larger effective batch by applying updates less often.
  • In distributed TensorFlow, reduce gradients across replicas before accumulating them.
  • Divide the loss or gradients by accum_steps to preserve expected update scale.
  • Store accumulated values in non-trainable variables and reset them after each optimizer step.
  • The same accumulation logic works in TF 2 custom loops and TF 1.14 eager mode, even though the APIs differ.

Course illustration
Course illustration

All Rights Reserved.