Loading folders of images in tensorflow
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
TensorFlow can load image datasets directly from folders, which is convenient for classification projects where each subdirectory represents a class. The key is to organize the files predictably and build a dataset pipeline that handles resizing, batching, and prefetching efficiently.
Use image_dataset_from_directory for Standard Folder Layouts
The easiest entry point is tf.keras.utils.image_dataset_from_directory. It expects a directory structure like this:
Each subfolder name becomes a class label when labels="inferred" is used.
This produces a tf.data.Dataset that yields image batches and integer labels ready for model training.
If you prefer one-hot labels for a softmax classifier, switch label_mode to categorical. For binary classification with a single output unit, binary can be more convenient.
Add Performance Steps to the Pipeline
Once the dataset is created, improve throughput with caching and prefetching.
These steps help overlap CPU-side loading with model execution. For small and medium datasets, cache() can make repeated epochs much faster. For very large datasets, caching may consume too much memory, so profile the real workload instead of assuming it is always beneficial.
You can also map preprocessing or augmentation into the dataset:
Use a Custom tf.data Pipeline When Folder Rules Are Not Enough
If labels come from filenames, metadata files, or nested folder patterns that do not match TensorFlow's default assumptions, build the dataset manually with tf.data.
This gives you more control over parsing logic and is the right option when the folder names alone are not the full labeling scheme.
A custom pipeline is also useful when training data lives in nested year or source folders, when labels come from a CSV file, or when you want different decoding rules for JPEG and PNG images in the same corpus.
It is also the better option when you need deterministic filtering, custom train and validation splits, or metadata-driven label remapping before the model ever sees a batch.
Common Pitfalls
- Using a directory layout that does not match the label inference rules you expect.
- Forgetting that
image_sizeresizes every image, which may affect aspect ratio and model behavior. - Caching a dataset that is too large to fit comfortably in memory.
- Building a custom pipeline without parallel mapping or prefetching, which can bottleneck training.
- Assuming folder names are returned in your preferred label order without checking
class_namesexplicitly.
Summary
- '
image_dataset_from_directoryis the easiest way to load class-organized image folders in TensorFlow.' - Folder names can be inferred as labels automatically.
- Add
cache,shuffle, andprefetchto improve training throughput. - Use a custom
tf.datapipeline when labels or file layout need more control. - Good image loading is not just about reading files, but about building a pipeline the model can consume efficiently.

