machine learning
tensorflow
keras
data validation
neural networks

Could validation data be a generator in tensorflow.keras 2.0?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Yes, tensorflow.keras can use generator-style input for validation data. This is useful when the validation set is too large to keep in memory or when you already have a batch pipeline for both training and evaluation.

How validation_data Works

Keras treats validation as a separate evaluation pass that runs at the end of each epoch. The important distinction is that validation_data can come from an iterator-like source, while validation_split only works when training data is already in memory as arrays or tensors.

In practice, validation_data can be:

  • a tuple like (x_val, y_val)
  • a tf.data.Dataset
  • a Python generator that yields batches
  • a keras.utils.Sequence object

That means the answer to the article title is yes, but with one condition: Keras must know how many validation batches to consume. If the generator is finite and ends naturally, Keras can read until exhaustion. If it is effectively endless, you must set validation_steps.

A Simple Generator Example

The most direct approach is a Python generator that yields batches of features and labels.

python
1import math
2import numpy as np
3import tensorflow as tf
4
5x_train = np.random.rand(1000, 10).astype("float32")
6y_train = (x_train.sum(axis=1) > 5).astype("float32")
7
8x_val = np.random.rand(200, 10).astype("float32")
9y_val = (x_val.sum(axis=1) > 5).astype("float32")
10
11
12def batch_generator(x, y, batch_size):
13    while True:
14        for start in range(0, len(x), batch_size):
15            end = start + batch_size
16            yield x[start:end], y[start:end]
17
18
19model = tf.keras.Sequential(
20    [
21        tf.keras.layers.Input(shape=(10,)),
22        tf.keras.layers.Dense(16, activation="relu"),
23        tf.keras.layers.Dense(1, activation="sigmoid"),
24    ]
25)
26
27model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
28
29train_gen = batch_generator(x_train, y_train, batch_size=32)
30val_gen = batch_generator(x_val, y_val, batch_size=32)
31
32model.fit(
33    train_gen,
34    steps_per_epoch=math.ceil(len(x_train) / 32),
35    epochs=3,
36    validation_data=val_gen,
37    validation_steps=math.ceil(len(x_val) / 32),
38)

This works because the generator returns the exact structure Keras expects: one batch of inputs and one batch of targets on each iteration.

Why Sequence Is Often Better

Plain Python generators are valid, but keras.utils.Sequence is usually the safer choice for production code. A Sequence knows its length, supports deterministic indexing, and integrates better with worker processes. It also makes the number of validation batches explicit, which reduces edge cases during training.

python
1import math
2import numpy as np
3import tensorflow as tf
4
5
6class ArraySequence(tf.keras.utils.Sequence):
7    def __init__(self, x, y, batch_size):
8        self.x = x
9        self.y = y
10        self.batch_size = batch_size
11
12    def __len__(self):
13        return math.ceil(len(self.x) / self.batch_size)
14
15    def __getitem__(self, index):
16        start = index * self.batch_size
17        end = start + self.batch_size
18        return self.x[start:end], self.y[start:end]
19
20
21x = np.random.rand(256, 8).astype("float32")
22y = (x.mean(axis=1) > 0.5).astype("float32")
23
24train_seq = ArraySequence(x[:200], y[:200], batch_size=16)
25val_seq = ArraySequence(x[200:], y[200:], batch_size=16)
26
27model = tf.keras.Sequential(
28    [
29        tf.keras.layers.Input(shape=(8,)),
30        tf.keras.layers.Dense(8, activation="relu"),
31        tf.keras.layers.Dense(1, activation="sigmoid"),
32    ]
33)
34model.compile(optimizer="adam", loss="binary_crossentropy")
35model.fit(train_seq, epochs=2, validation_data=val_seq)

For most hand-written pipelines, Sequence gives the convenience of a generator without the ambiguity of an unbounded iterator.

When to Use tf.data Instead

If you are already using TensorFlow 2.x idioms, tf.data.Dataset is often the cleanest solution. It makes batching, caching, shuffling, and prefetching explicit, and it is usually easier to optimize than a custom generator.

python
1import tensorflow as tf
2
3features = tf.random.uniform((300, 6))
4labels = tf.cast(tf.reduce_sum(features, axis=1) > 3.0, tf.float32)
5
6train_ds = tf.data.Dataset.from_tensor_slices((features[:240], labels[:240]))
7train_ds = train_ds.shuffle(240).batch(32).prefetch(tf.data.AUTOTUNE)
8
9val_ds = tf.data.Dataset.from_tensor_slices((features[240:], labels[240:]))
10val_ds = val_ds.batch(32)
11
12model = tf.keras.Sequential(
13    [
14        tf.keras.layers.Input(shape=(6,)),
15        tf.keras.layers.Dense(12, activation="relu"),
16        tf.keras.layers.Dense(1, activation="sigmoid"),
17    ]
18)
19model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
20model.fit(train_ds, epochs=2, validation_data=val_ds)

The key design choice is not whether validation data must be an in-memory array. It does not. The real choice is which iterable API is easiest to reason about and maintain.

Common Pitfalls

Using validation_split with a generator does not work because Keras cannot split a streaming source the same way it can split a NumPy array.

Forgetting validation_steps on an endless validation generator can cause validation to run forever at the end of an epoch.

Applying random augmentation to validation batches can make metrics noisy and hard to compare between epochs. Validation data should usually be deterministic.

Returning the wrong tuple shape from the generator, such as inputs without labels, causes confusing runtime errors during fit.

Summary

  • 'validation_data in tensorflow.keras can be a Python generator, Sequence, or tf.data.Dataset.'
  • 'validation_split is different and only works with in-memory data.'
  • Use validation_steps when the validation iterator does not naturally terminate.
  • Prefer keras.utils.Sequence or tf.data when you want clearer behavior and easier maintenance.
  • Keep validation preprocessing stable so reported metrics are meaningful across epochs.

Course illustration
Course illustration

All Rights Reserved.