Keras occupies an indefinitely increasing amount of memory for each epoch
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
If memory usage appears to grow every epoch during Keras training, the cause is not always a true framework leak. The common causes are graph accumulation, callbacks or logs retaining data, dataset pipelines caching too much, generators holding references, or simply TensorFlow’s allocator reserving memory in a way that looks like a leak from the outside.
First Distinguish A Leak From Reserved Memory
TensorFlow often reserves memory aggressively, especially on GPU. That can make system monitors show high usage even when the framework intends to reuse that memory.
A real leak usually has one of these patterns:
- every epoch increases host memory and it never stabilizes
- every repeated model creation increases memory because old graphs stay alive
- callbacks or saved histories grow with training duration
A simple one-time jump that stays flat is usually allocation strategy, not a leak.
Common Real Causes
The most frequent causes are:
- creating new models in a loop without clearing old ones
- calling
fitrepeatedly inside code that also accumulates tensors or histories - custom callbacks storing full predictions or batch outputs every epoch
- '
tf.datacaching or prefetch pipelines holding more data than expected' - Python references preventing garbage collection
The right fix depends on which of these is happening.
A Safe Pattern For Repeated Model Creation
If you rebuild models many times, clear the old graph between runs.
Use clear_session() between separate model lifecycles, not as a routine inside every epoch of one normal training run.
Example Of A Self-Inflicted Leak
This callback stores every prediction for every epoch in memory.
The model may look like it leaks, but the real issue is the callback retaining large arrays indefinitely.
A safer callback stores summary statistics instead of full tensors.
Watch The Input Pipeline
A tf.data.Dataset.cache() call can intentionally hold the entire dataset in memory. That is sometimes correct, but it can also be the hidden reason memory grows after the first pass.
Likewise, custom Python generators may keep references to large batches or decoded data structures if they are written carelessly.
Practical Debugging Steps
Use a short checklist:
- train for a few epochs and see whether memory stabilizes
- disable custom callbacks
- simplify the input pipeline
- run on CPU only once to separate host and GPU behavior
- rebuild the script in a clean process and compare runs
That process tells you whether the issue is allocator behavior, pipeline retention, or repeated graph creation.
Common Pitfalls
The most common mistake is calling clear_session() inside a normal epoch loop while still using the same model. That is not the right fix and can break training logic.
Another mistake is assuming Keras is leaking when TensorFlow is simply keeping GPU memory reserved for reuse.
A third issue is forgetting that Python containers such as lists, histories, and callback fields can keep large arrays alive long after an epoch ends.
Summary
- Memory growth per epoch can come from retained Python objects, graph accumulation, dataset caching, or allocator behavior.
- A one-time reserved-memory jump is not the same as a real leak.
- Use
clear_session()between separate model creations, not as an every-epoch ritual. - Inspect callbacks and input pipelines before blaming the framework.
- Simplify the training loop to isolate where memory actually starts accumulating.

