Reproducible results using Keras with TensorFlow backend

Reproducibility

Keras

TensorFlow

Machine Learning

Deep Learning

Reproducible results using Keras with TensorFlow backend

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Reproducibility in Keras with TensorFlow means that the same code, data, and environment produce the same training results across runs. Getting close to that goal requires more than setting one seed, because randomness can enter through Python, NumPy, TensorFlow ops, data pipelines, hardware, and even package versions.

Set seeds the modern way

The most practical starting point is Keras' seed helper:

python

1import keras
2import tensorflow as tf
3
4keras.utils.set_random_seed(42)
5tf.config.experimental.enable_op_determinism()

keras.utils.set_random_seed sets the Python, NumPy, and TensorFlow random seeds together. enable_op_determinism() asks TensorFlow to prefer deterministic kernels where possible. This is the closest thing to a default reproducibility switch in the current Keras and TensorFlow stack.

If you use your own random generators, seed them explicitly too. For example, numpy.random.default_rng() ignores the older global NumPy seed unless you pass a seed directly:

python

import numpy as np

rng = np.random.default_rng(42)

Make the data pipeline deterministic

Your model can still vary if the training data arrives in a different order on each run. Keep data loading deterministic by fixing shuffle seeds and avoiding unnecessary nondeterministic preprocessing.

python

1import tensorflow as tf
2
3dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
4dataset = dataset.shuffle(buffer_size=len(x_train), seed=42, reshuffle_each_iteration=False)
5dataset = dataset.batch(32)
6dataset = dataset.prefetch(tf.data.AUTOTUNE)

That reshuffle_each_iteration=False flag matters. Without it, the pipeline can produce a different example order on each epoch even when the initial seed is fixed.

If you use image augmentation layers or random TensorFlow ops inside the pipeline, seed those components too. Reproducibility breaks quickly when one augmentation step is left unseeded.

Build and train with controlled randomness

Here is a minimal reproducible Keras example:

python

1import keras
2import tensorflow as tf
3import numpy as np
4
5keras.utils.set_random_seed(42)
6tf.config.experimental.enable_op_determinism()
7
8x = np.random.default_rng(42).normal(size=(1000, 20)).astype("float32")
9y = (x.sum(axis=1) > 0).astype("float32")
10
11model = keras.Sequential(
12    [
13        keras.layers.Dense(32, activation="relu", input_shape=(20,)),
14        keras.layers.Dropout(0.2),
15        keras.layers.Dense(1, activation="sigmoid"),
16    ]
17)
18
19model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
20
21history = model.fit(x, y, epochs=3, batch_size=32, shuffle=False, verbose=0)
22print(history.history)

This example avoids data-order randomness by setting shuffle=False in fit. In a real project, you can shuffle deterministically, but the key point is to make the randomness explicit rather than accidental.

Pin the environment, not just the code

Even with perfect seeds, exact reproducibility is not guaranteed across different software and hardware environments. TensorFlow itself documents that determinism is tied to running on the same hardware and in the same software stack.

For serious experiment tracking, record:

Python version
Keras version
TensorFlow version
CUDA and cuDNN versions when using GPUs
Operating system
CPU or GPU model

A container image or a locked dependency file is often the difference between "mostly repeatable" and "actually reproducible."

Know the limits

Deterministic TensorFlow can still run slower, and not every operation or distribution strategy behaves identically in every setup. Multi-worker training, parameter-server strategies, and some custom ops are common sources of drift.

There is also a difference between reproducible training and reproducible inference. Inference is usually easier because it removes dropout, shuffling, optimizer state, and much of the training-time randomness.

The practical goal is usually one of these:

Bitwise-identical reruns on the same machine
Functionally consistent metrics across reruns
Fully documented experiments that can be re-executed later with the same environment

Be explicit about which goal you need.

Common Pitfalls

The biggest mistake is setting only tf.random.set_seed and assuming that covers Python and NumPy randomness too. It does not.

Another common issue is forgetting the input pipeline. Deterministic model code still produces different training results if the dataset is shuffled differently.

People also change hardware or package versions and then blame the seed. Reproducibility is always conditional on the environment.

Finally, do not assume determinism is free. Some deterministic execution paths can reduce throughput, especially on GPU workloads.

Summary

Use keras.utils.set_random_seed(...) to seed Python, NumPy, and TensorFlow together.
Enable deterministic TensorFlow ops with tf.config.experimental.enable_op_determinism().
Make the data pipeline deterministic with fixed shuffle seeds and stable preprocessing.
Record the full software and hardware environment, not just the training script.
Treat reproducibility as a system property, not a single line of code.