Detecting corrupt images in Tensorflow
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Corrupt image files are a common reason TensorFlow image pipelines fail halfway through training. The safest time to detect them is before the dataset reaches a long-running training loop, because a single bad file can break a tf.data job after you have already paid the cost of shuffling, decoding, and preprocessing thousands of valid examples.
What Counts as a Corrupt Image
In practice, an image is corrupt if TensorFlow cannot decode it into the shape and format your model expects. That can mean the file is truncated, mislabeled, not actually an image, or valid enough to decode but still wrong for the pipeline because it has an unexpected channel count or shape.
That last case matters. A file can be readable and still be unusable for your model.
Pre-Scan Files with TensorFlow Decoding
A reliable approach is to scan the file paths first and let TensorFlow try to decode each file. That catches real parser failures instead of relying on filename extensions.
This gives you visibility into which files are bad instead of silently skipping them.
Build the tf.data Pipeline from Valid Files
Once the valid file list is known, keep the main input pipeline strict.
This is the most predictable training setup because all corruption handling happened before the parallel pipeline started.
Use ignore_errors() Carefully
TensorFlow can skip bad records during pipeline execution, but that should be a conscious choice rather than the default.
This keeps training alive, but it can also hide a broken export job or a bad download if you are not logging the rejected files somewhere else. Silent skipping is convenient, yet it reduces your visibility into data quality.
Validate the Exact Contract You Need
If your model expects JPEG files with three channels and at least a certain size, validate exactly that. Generic decoding is often not enough. A file that decodes as grayscale or that is too small may still create downstream problems during resize, augmentation, or batching.
That is why many pipelines check more than readability:
- decoder compatibility with the expected format
- channel count
- minimum image size
- shape assumptions after decoding
Common Pitfalls
- Assuming file enumeration proves the images are valid.
- Relying only on file extensions instead of real decoding.
- Using
ignore_errors()without tracking which files were skipped. - Using a generic decoder when the pipeline actually expects one specific image format.
Summary
- The best corruption check is to let TensorFlow actually decode the image.
- Pre-scan files when you want strong visibility into which examples are invalid.
- Build the main
tf.datapipeline from validated paths whenever possible. - Use
ignore_errors()only when you accept the tradeoff between resilience and visibility. - Validate shape and channel assumptions, not just whether the file can be opened.

