Keras
flowFromDirectory
file names
data generation
deep learning

Keras flowFromDirectory get file names as they are being generated

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When flow_from_directory yields batches, it returns image tensors and labels, but it does not return the file names as a third value by default. The useful detail is that the iterator already knows every path it will load, so you can recover batch file names if you understand how the generator indexes its samples.

What flow_from_directory Returns

ImageDataGenerator.flow_from_directory(...) creates a DirectoryIterator. That iterator scans the directory tree once, builds a stable list of samples, and then yields batches of images and labels.

A basic setup looks like this:

python
1from tensorflow.keras.preprocessing.image import ImageDataGenerator
2
3datagen = ImageDataGenerator(rescale=1.0 / 255.0)
4
5generator = datagen.flow_from_directory(
6    "data/train",
7    target_size=(224, 224),
8    batch_size=4,
9    class_mode="categorical",
10    shuffle=False
11)

The important point is that the iterator already stores metadata about all discovered files. In many Keras versions, you can inspect:

  • 'generator.filenames'
  • 'generator.filepaths'
  • 'generator.classes'

If you only want the full list once, that is enough:

python
print(generator.filepaths[:5])

Getting File Names for the Current Batch

The trickier case is matching file names to the batch you just pulled with next(generator). The simplest way is to disable shuffling and derive the slice from the batch position.

python
1from tensorflow.keras.preprocessing.image import ImageDataGenerator
2
3datagen = ImageDataGenerator(rescale=1.0 / 255.0)
4generator = datagen.flow_from_directory(
5    "data/train",
6    target_size=(224, 224),
7    batch_size=4,
8    class_mode="categorical",
9    shuffle=False
10)
11
12images, labels = next(generator)
13
14start = (generator.batch_index - 1) * generator.batch_size
15end = start + len(images)
16batch_paths = generator.filepaths[start:end]
17
18print(batch_paths)

This works because the generator iterates deterministically when shuffle=False.

What Happens When shuffle=True

Once shuffling is enabled, the simple slice approach is no longer enough. The batch order is driven by an index array rather than the original file order.

In that case, a practical solution is to subclass the iterator so you can capture the exact index_array used for each batch:

python
1from tensorflow.keras.preprocessing.image import ImageDataGenerator, DirectoryIterator
2
3
4class DirectoryIteratorWithPaths(DirectoryIterator):
5    def _get_batches_of_transformed_samples(self, index_array):
6        batch_x, batch_y = super()._get_batches_of_transformed_samples(index_array)
7        batch_paths = [self.filepaths[i] for i in index_array]
8        return batch_x, batch_y, batch_paths
9
10
11datagen = ImageDataGenerator(rescale=1.0 / 255.0)
12
13iterator = DirectoryIteratorWithPaths(
14    "data/train",
15    datagen,
16    target_size=(224, 224),
17    batch_size=4,
18    class_mode="categorical",
19    shuffle=True
20)
21
22images, labels, paths = next(iterator)
23print(paths)

That gives you the actual file paths associated with the batch currently being generated.

When You Only Need Predictions Paired with Names

A very common use case is inference rather than training. In that situation, the cleanest answer is often:

  • set shuffle=False
  • run prediction
  • pair outputs with generator.filepaths
python
1from tensorflow.keras.preprocessing.image import ImageDataGenerator
2import pandas as pd
3
4datagen = ImageDataGenerator(rescale=1.0 / 255.0)
5generator = datagen.flow_from_directory(
6    "data/test",
7    target_size=(224, 224),
8    batch_size=8,
9    class_mode=None,
10    shuffle=False
11)
12
13predictions = model.predict(generator)
14
15results = pd.DataFrame({
16    "file": generator.filepaths,
17    "score": predictions[:, 0]
18})
19
20print(results.head())

That is usually better than trying to print file names during every training step.

A Note on Current Keras Practice

flow_from_directory is still common in older codebases, but many newer Keras workflows prefer image_dataset_from_directory or custom tf.data pipelines. Those APIs can be easier to extend when you need sample metadata such as file paths. Still, if you are already using flow_from_directory, the iterator metadata is enough for most filename-tracking needs.

Common Pitfalls

  • Forgetting that shuffle=True breaks the assumption that a batch maps to a straight slice of filepaths.
  • Using generator.filenames when you actually need full paths from generator.filepaths.
  • Relying on batch_index math without testing what happens at epoch boundaries.
  • Logging source file names during heavily augmented training without remembering that the tensor is transformed while the path is not.
  • Adding per-batch filename plumbing when a simple shuffle=False prediction pass would solve the real problem more cleanly.

Summary

  • 'flow_from_directory does not yield file names by default, but the iterator stores them internally'
  • Use generator.filepaths or generator.filenames when you only need the sample list
  • With shuffle=False, you can map a batch to a slice of file paths
  • With shuffle=True, a subclass or wrapper is the safest way to capture the current batch paths
  • For prediction workflows, pairing model.predict(...) with generator.filepaths is often the cleanest design

Course illustration
Course illustration

All Rights Reserved.