TensorFlow
image processing
data loading
deep learning
image classification

Loading folders of images in tensorflow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

TensorFlow can load image datasets directly from folders, which is convenient for classification projects where each subdirectory represents a class. The key is to organize the files predictably and build a dataset pipeline that handles resizing, batching, and prefetching efficiently.

Use image_dataset_from_directory for Standard Folder Layouts

The easiest entry point is tf.keras.utils.image_dataset_from_directory. It expects a directory structure like this:

text
1dataset/
2  cats/
3    cat1.jpg
4    cat2.jpg
5  dogs/
6    dog1.jpg
7    dog2.jpg

Each subfolder name becomes a class label when labels="inferred" is used.

python
1import tensorflow as tf
2
3train_ds = tf.keras.utils.image_dataset_from_directory(
4    "dataset",
5    labels="inferred",
6    label_mode="int",
7    image_size=(224, 224),
8    batch_size=32,
9    validation_split=0.2,
10    subset="training",
11    seed=42,
12)
13
14val_ds = tf.keras.utils.image_dataset_from_directory(
15    "dataset",
16    labels="inferred",
17    label_mode="int",
18    image_size=(224, 224),
19    batch_size=32,
20    validation_split=0.2,
21    subset="validation",
22    seed=42,
23)
24
25print(train_ds.class_names)

This produces a tf.data.Dataset that yields image batches and integer labels ready for model training.

If you prefer one-hot labels for a softmax classifier, switch label_mode to categorical. For binary classification with a single output unit, binary can be more convenient.

Add Performance Steps to the Pipeline

Once the dataset is created, improve throughput with caching and prefetching.

python
1AUTOTUNE = tf.data.AUTOTUNE
2
3train_ds = train_ds.cache().shuffle(1000).prefetch(AUTOTUNE)
4val_ds = val_ds.cache().prefetch(AUTOTUNE)

These steps help overlap CPU-side loading with model execution. For small and medium datasets, cache() can make repeated epochs much faster. For very large datasets, caching may consume too much memory, so profile the real workload instead of assuming it is always beneficial.

You can also map preprocessing or augmentation into the dataset:

python
1normalizer = tf.keras.layers.Rescaling(1.0 / 255)
2
3train_ds = train_ds.map(lambda x, y: (normalizer(x), y))
4val_ds = val_ds.map(lambda x, y: (normalizer(x), y))

Use a Custom tf.data Pipeline When Folder Rules Are Not Enough

If labels come from filenames, metadata files, or nested folder patterns that do not match TensorFlow's default assumptions, build the dataset manually with tf.data.

python
1import tensorflow as tf
2
3files = tf.data.Dataset.list_files("dataset/*/*.jpg", shuffle=True)
4
5
6def load_image(path):
7    image = tf.io.read_file(path)
8    image = tf.image.decode_jpeg(image, channels=3)
9    image = tf.image.resize(image, [224, 224])
10    image = image / 255.0
11
12    parts = tf.strings.split(path, "/")
13    label_name = parts[-2]
14    label = tf.cast(tf.equal(label_name, "dogs"), tf.int32)
15    return image, label
16
17
18ds = files.map(load_image, num_parallel_calls=tf.data.AUTOTUNE)
19ds = ds.batch(32).prefetch(tf.data.AUTOTUNE)

This gives you more control over parsing logic and is the right option when the folder names alone are not the full labeling scheme.

A custom pipeline is also useful when training data lives in nested year or source folders, when labels come from a CSV file, or when you want different decoding rules for JPEG and PNG images in the same corpus.

It is also the better option when you need deterministic filtering, custom train and validation splits, or metadata-driven label remapping before the model ever sees a batch.

Common Pitfalls

  • Using a directory layout that does not match the label inference rules you expect.
  • Forgetting that image_size resizes every image, which may affect aspect ratio and model behavior.
  • Caching a dataset that is too large to fit comfortably in memory.
  • Building a custom pipeline without parallel mapping or prefetching, which can bottleneck training.
  • Assuming folder names are returned in your preferred label order without checking class_names explicitly.

Summary

  • 'image_dataset_from_directory is the easiest way to load class-organized image folders in TensorFlow.'
  • Folder names can be inferred as labels automatically.
  • Add cache, shuffle, and prefetch to improve training throughput.
  • Use a custom tf.data pipeline when labels or file layout need more control.
  • Good image loading is not just about reading files, but about building a pipeline the model can consume efficiently.

Course illustration
Course illustration

All Rights Reserved.