How to accumulate gradients in tensorflow?

tensorflow

gradient accumulation

deep learning

machine learning

neural networks

How to accumulate gradients in tensorflow?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Gradient accumulation lets you simulate a larger effective batch size without loading that full batch into memory at once. Instead of applying gradients after every mini-batch, you sum gradients across several smaller batches and update the model only after the configured number of accumulation steps.

This is useful when the batch size you want for training stability does not fit in GPU memory. The key detail is that you are not skipping optimization. You are delaying it until enough mini-batches have contributed to the gradient.

When gradient accumulation helps

Suppose you want an effective batch size of 128, but your GPU can only fit 32 examples at a time. You can process four mini-batches of size 32, accumulate gradients from each one, and apply the combined update once.

In effect:

mini-batch size: 32
accumulation steps: 4
effective batch size: 128

This does not make training identical to a true batch of 128 in every possible setup, but it is often close enough to be very useful.

Accumulate gradients with `GradientTape`

In TensorFlow 2, the most direct approach is to use a custom training loop with tf.GradientTape.

python

1import tensorflow as tf
2
3model = tf.keras.Sequential([
4    tf.keras.layers.Dense(64, activation="relu"),
5    tf.keras.layers.Dense(10),
6])
7
8optimizer = tf.keras.optimizers.Adam(1e-3)
9loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
10
11accumulation_steps = 4
12gradient_buffer = [
13    tf.Variable(tf.zeros_like(var), trainable=False)
14    for var in model.trainable_variables
15]

The gradient_buffer stores the running sum of gradients between optimizer updates.

Scale the loss and apply gradients periodically

Inside the training loop, compute gradients for each mini-batch and add them into the buffer. A common pattern is to divide the loss by accumulation_steps so the final accumulated gradient has the same scale as one large averaged batch.

python

1@tf.function
2def train_step(x, y, step):
3    with tf.GradientTape() as tape:
4        logits = model(x, training=True)
5        loss = loss_fn(y, logits) / accumulation_steps
6
7    gradients = tape.gradient(loss, model.trainable_variables)
8
9    for buffer_var, grad in zip(gradient_buffer, gradients):
10        if grad is not None:
11            buffer_var.assign_add(grad)
12
13    if tf.equal((step + 1) % accumulation_steps, 0):
14        optimizer.apply_gradients(zip(gradient_buffer, model.trainable_variables))
15
16        for buffer_var in gradient_buffer:
17            buffer_var.assign(tf.zeros_like(buffer_var))

Then call train_step as you iterate through the dataset:

python

for step, (x_batch, y_batch) in enumerate(train_dataset):
    train_step(x_batch, y_batch, tf.constant(step))

This applies the optimizer update only every fourth mini-batch.

Handle the last partial accumulation window

One easy mistake is forgetting the final incomplete group of mini-batches. If the epoch length is not divisible by accumulation_steps, you still need to apply whatever gradients remain in the buffer at the end.

In a Python-driven loop, you can track that explicitly:

python

1pending_steps = 0
2
3for step, (x_batch, y_batch) in enumerate(train_dataset):
4    with tf.GradientTape() as tape:
5        logits = model(x_batch, training=True)
6        loss = loss_fn(y_batch, logits) / accumulation_steps
7
8    gradients = tape.gradient(loss, model.trainable_variables)
9
10    for buffer_var, grad in zip(gradient_buffer, gradients):
11        if grad is not None:
12            buffer_var.assign_add(grad)
13
14    pending_steps += 1
15
16    if pending_steps == accumulation_steps:
17        optimizer.apply_gradients(zip(gradient_buffer, model.trainable_variables))
18        for buffer_var in gradient_buffer:
19            buffer_var.assign(tf.zeros_like(buffer_var))
20        pending_steps = 0
21
22if pending_steps > 0:
23    optimizer.apply_gradients(zip(gradient_buffer, model.trainable_variables))

Without this final flush, the last few mini-batches of each epoch are ignored.

Think about learning rate and batch-dependent layers

Gradient accumulation changes the effective batch size, so optimizer behavior may shift. In some training setups, you may need to revisit the learning rate because the optimization step now reflects more examples per update.

Also remember that layers such as batch normalization still see the physical mini-batch size, not the accumulated effective batch size. Gradient accumulation helps optimizer statistics, but it does not magically change batch-dependent layer behavior.

Common Pitfalls

The most common mistake is accumulating gradients without resetting the buffer after apply_gradients. That causes gradients to keep growing across unrelated updates.

Another issue is forgetting to scale the loss or otherwise account for how many mini-batches contributed to the final update. If you do not normalize correctly, the effective update magnitude changes.

Developers also often forget the final partial accumulation window at the end of an epoch, which silently drops training signal.

Finally, gradient accumulation is not the same as batch normalization over a larger batch. It helps optimizer updates, but it does not change every batch-sensitive component in the model.

Summary

Gradient accumulation simulates a larger effective batch size by delaying optimizer updates.
In TensorFlow, a custom GradientTape loop is the most direct way to implement it.
Accumulate gradients into a buffer, apply them every accumulation_steps, then reset the buffer.
Do not forget to flush the final partial buffer at the end of the epoch.
Revisit learning rate and batch-sensitive layers when changing effective batch size.

How to accumulate gradients in tensorflow?

Master System Design with Codemia

Introduction

When gradient accumulation helps

Accumulate gradients with GradientTape

Scale the loss and apply gradients periodically

Handle the last partial accumulation window

Think about learning rate and batch-dependent layers

Common Pitfalls

Summary

Accumulate gradients with `GradientTape`