Reproducible results using Keras with TensorFlow backend
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Reproducibility in Keras with TensorFlow means that the same code, data, and environment produce the same training results across runs. Getting close to that goal requires more than setting one seed, because randomness can enter through Python, NumPy, TensorFlow ops, data pipelines, hardware, and even package versions.
Set seeds the modern way
The most practical starting point is Keras' seed helper:
keras.utils.set_random_seed sets the Python, NumPy, and TensorFlow random seeds together. enable_op_determinism() asks TensorFlow to prefer deterministic kernels where possible. This is the closest thing to a default reproducibility switch in the current Keras and TensorFlow stack.
If you use your own random generators, seed them explicitly too. For example, numpy.random.default_rng() ignores the older global NumPy seed unless you pass a seed directly:
Make the data pipeline deterministic
Your model can still vary if the training data arrives in a different order on each run. Keep data loading deterministic by fixing shuffle seeds and avoiding unnecessary nondeterministic preprocessing.
That reshuffle_each_iteration=False flag matters. Without it, the pipeline can produce a different example order on each epoch even when the initial seed is fixed.
If you use image augmentation layers or random TensorFlow ops inside the pipeline, seed those components too. Reproducibility breaks quickly when one augmentation step is left unseeded.
Build and train with controlled randomness
Here is a minimal reproducible Keras example:
This example avoids data-order randomness by setting shuffle=False in fit. In a real project, you can shuffle deterministically, but the key point is to make the randomness explicit rather than accidental.
Pin the environment, not just the code
Even with perfect seeds, exact reproducibility is not guaranteed across different software and hardware environments. TensorFlow itself documents that determinism is tied to running on the same hardware and in the same software stack.
For serious experiment tracking, record:
- Python version
- Keras version
- TensorFlow version
- CUDA and cuDNN versions when using GPUs
- Operating system
- CPU or GPU model
A container image or a locked dependency file is often the difference between "mostly repeatable" and "actually reproducible."
Know the limits
Deterministic TensorFlow can still run slower, and not every operation or distribution strategy behaves identically in every setup. Multi-worker training, parameter-server strategies, and some custom ops are common sources of drift.
There is also a difference between reproducible training and reproducible inference. Inference is usually easier because it removes dropout, shuffling, optimizer state, and much of the training-time randomness.
The practical goal is usually one of these:
- Bitwise-identical reruns on the same machine
- Functionally consistent metrics across reruns
- Fully documented experiments that can be re-executed later with the same environment
Be explicit about which goal you need.
Common Pitfalls
The biggest mistake is setting only tf.random.set_seed and assuming that covers Python and NumPy randomness too. It does not.
Another common issue is forgetting the input pipeline. Deterministic model code still produces different training results if the dataset is shuffled differently.
People also change hardware or package versions and then blame the seed. Reproducibility is always conditional on the environment.
Finally, do not assume determinism is free. Some deterministic execution paths can reduce throughput, especially on GPU workloads.
Summary
- Use
keras.utils.set_random_seed(...)to seed Python, NumPy, and TensorFlow together. - Enable deterministic TensorFlow ops with
tf.config.experimental.enable_op_determinism(). - Make the data pipeline deterministic with fixed shuffle seeds and stable preprocessing.
- Record the full software and hardware environment, not just the training script.
- Treat reproducibility as a system property, not a single line of code.

