ImageDataGenerator
TensorFlow Datasets
TF2
Deep Learning
Data Augmentation

How can I combine ImageDataGenerator with TensorFlow datasets in TF2?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

You can combine ImageDataGenerator with a tf.data.Dataset in TensorFlow 2, but it is usually a compatibility workaround rather than the best design. ImageDataGenerator comes from the older Keras preprocessing workflow and expects NumPy arrays, while tf.data is designed for tensor-native pipelines. In modern TF2 code, the cleaner answer is often to skip ImageDataGenerator and do augmentation with tf.image or Keras preprocessing layers instead.

Why the Two APIs Do Not Fit Naturally

The awkward part is the data model. ImageDataGenerator wants Python-side arrays and yields batches from a generator. tf.data.Dataset wants tensor transformations that can be batched, parallelized, prefetched, and optimized inside TensorFlow.

That creates two realistic paths:

  • bridge legacy augmentation into tf.data
  • keep the entire pipeline native to TensorFlow

The first path is possible. The second is usually the better long-term answer.

Bridge a Dataset Item Through tf.numpy_function

If you already have a working ImageDataGenerator configuration and need to keep it for compatibility, you can wrap it with tf.numpy_function. The idea is to convert the tensor to NumPy, let the generator transform it, then return a tensor-shaped result.

python
1import numpy as np
2import tensorflow as tf
3import tensorflow_datasets as tfds
4
5augmenter = tf.keras.preprocessing.image.ImageDataGenerator(
6    rotation_range=20,
7    horizontal_flip=True,
8    rescale=1.0 / 255.0,
9)
10
11
12def augment_numpy(image, label):
13    batch = np.expand_dims(image, axis=0)
14    augmented = next(augmenter.flow(batch, batch_size=1, shuffle=False))[0]
15    return augmented.astype(np.float32), label
16
17
18def augment_tf(image, label):
19    image, label = tf.numpy_function(
20        augment_numpy,
21        [image, label],
22        [tf.float32, label.dtype],
23    )
24    image.set_shape([None, None, 3])
25    label.set_shape([])
26    return image, label
27
28
29ds = tfds.load("tf_flowers", split="train", as_supervised=True)
30ds = ds.map(augment_tf)

This works, but it has real costs. You cross the Python boundary, lose some graph optimizations, and often have to restore static shapes manually with set_shape.

Another Bridge Pattern: Build the Dataset From a Generator

If the legacy generator already owns the batching and augmentation flow, another option is to wrap that generator directly with Dataset.from_generator.

python
1import tensorflow as tf
2import numpy as np
3
4x = np.random.randint(0, 255, size=(16, 64, 64, 3), dtype=np.uint8)
5y = np.random.randint(0, 2, size=(16,), dtype=np.int32)
6
7augmenter = tf.keras.preprocessing.image.ImageDataGenerator(rotation_range=10)
8flow = augmenter.flow(x, y, batch_size=4)
9
10output_signature = (
11    tf.TensorSpec(shape=(None, 64, 64, 3), dtype=tf.float32),
12    tf.TensorSpec(shape=(None,), dtype=tf.int32),
13)
14
15ds = tf.data.Dataset.from_generator(lambda: flow, output_signature=output_signature)

This is workable when the generator is the source of truth, but it still keeps the pipeline in Python-land rather than using TensorFlow-native augmentation.

The Preferred TF2 Approach

In TensorFlow 2, the cleaner solution is usually to keep augmentation inside the dataset pipeline or inside the model with preprocessing layers.

python
1import tensorflow as tf
2import tensorflow_datasets as tfds
3
4
5def preprocess(image, label):
6    image = tf.image.resize(image, [224, 224])
7    image = tf.cast(image, tf.float32) / 255.0
8    image = tf.image.random_flip_left_right(image)
9    image = tf.image.random_brightness(image, max_delta=0.1)
10    return image, label
11
12
13ds = tfds.load("tf_flowers", split="train", as_supervised=True)
14ds = ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE)
15ds = ds.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)

This keeps the pipeline tensor-native, which usually means better throughput and fewer shape headaches.

Keras preprocessing layers are another strong option.

python
1import tensorflow as tf
2
3augmentation = tf.keras.Sequential([
4    tf.keras.layers.RandomFlip("horizontal"),
5    tf.keras.layers.RandomRotation(0.1),
6])

Those layers can live in the model itself, which keeps training-time augmentation close to the network definition.

How To Choose

Use the bridge only when you already depend on ImageDataGenerator behavior and do not want to rewrite the pipeline immediately. Use native tf.data or preprocessing layers for new projects or when you are already refactoring the input pipeline.

That is the practical distinction. This is less about what is theoretically possible and more about where you want your data pipeline to live.

Common Pitfalls

  • Expecting ImageDataGenerator to consume tensors from a tf.data.Dataset directly without a compatibility layer.
  • Using tf.numpy_function and then forgetting to restore static shape information.
  • Keeping Python-side augmentation in a performance-sensitive pipeline and then wondering why throughput drops.
  • Mixing batch-level generator behavior with item-level dataset mapping without being clear which layer owns batching.
  • Carrying forward legacy preprocessing code when native TF2 augmentation would be simpler and easier to maintain.

Summary

  • 'ImageDataGenerator can be combined with tf.data, but the fit is awkward because the APIs were designed for different execution models.'
  • The usual bridge is tf.numpy_function or Dataset.from_generator.
  • Those bridges work, but they give up some TensorFlow pipeline advantages.
  • In modern TF2 code, native tf.image transforms or Keras preprocessing layers are usually the better solution.
  • Use ImageDataGenerator bridging mainly as a migration step, not as the preferred architecture for new pipelines.

Course illustration
Course illustration