Could validation data be a generator in tensorflow.keras 2.0?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Yes, tensorflow.keras can use generator-style input for validation data. This is useful when the validation set is too large to keep in memory or when you already have a batch pipeline for both training and evaluation.
How validation_data Works
Keras treats validation as a separate evaluation pass that runs at the end of each epoch. The important distinction is that validation_data can come from an iterator-like source, while validation_split only works when training data is already in memory as arrays or tensors.
In practice, validation_data can be:
- a tuple like
(x_val, y_val) - a
tf.data.Dataset - a Python generator that yields batches
- a
keras.utils.Sequenceobject
That means the answer to the article title is yes, but with one condition: Keras must know how many validation batches to consume. If the generator is finite and ends naturally, Keras can read until exhaustion. If it is effectively endless, you must set validation_steps.
A Simple Generator Example
The most direct approach is a Python generator that yields batches of features and labels.
This works because the generator returns the exact structure Keras expects: one batch of inputs and one batch of targets on each iteration.
Why Sequence Is Often Better
Plain Python generators are valid, but keras.utils.Sequence is usually the safer choice for production code. A Sequence knows its length, supports deterministic indexing, and integrates better with worker processes. It also makes the number of validation batches explicit, which reduces edge cases during training.
For most hand-written pipelines, Sequence gives the convenience of a generator without the ambiguity of an unbounded iterator.
When to Use tf.data Instead
If you are already using TensorFlow 2.x idioms, tf.data.Dataset is often the cleanest solution. It makes batching, caching, shuffling, and prefetching explicit, and it is usually easier to optimize than a custom generator.
The key design choice is not whether validation data must be an in-memory array. It does not. The real choice is which iterable API is easiest to reason about and maintain.
Common Pitfalls
Using validation_split with a generator does not work because Keras cannot split a streaming source the same way it can split a NumPy array.
Forgetting validation_steps on an endless validation generator can cause validation to run forever at the end of an epoch.
Applying random augmentation to validation batches can make metrics noisy and hard to compare between epochs. Validation data should usually be deterministic.
Returning the wrong tuple shape from the generator, such as inputs without labels, causes confusing runtime errors during fit.
Summary
- '
validation_dataintensorflow.kerascan be a Python generator,Sequence, ortf.data.Dataset.' - '
validation_splitis different and only works with in-memory data.' - Use
validation_stepswhen the validation iterator does not naturally terminate. - Prefer
keras.utils.Sequenceortf.datawhen you want clearer behavior and easier maintenance. - Keep validation preprocessing stable so reported metrics are meaningful across epochs.

