Detecting corrupt images in Tensorflow

Tensorflow

Image Processing

Machine Learning

Data Validation

Corruption Detection

Detecting corrupt images in Tensorflow

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Corrupt image files are a common reason TensorFlow image pipelines fail halfway through training. The safest time to detect them is before the dataset reaches a long-running training loop, because a single bad file can break a tf.data job after you have already paid the cost of shuffling, decoding, and preprocessing thousands of valid examples.

What Counts as a Corrupt Image

In practice, an image is corrupt if TensorFlow cannot decode it into the shape and format your model expects. That can mean the file is truncated, mislabeled, not actually an image, or valid enough to decode but still wrong for the pipeline because it has an unexpected channel count or shape.

That last case matters. A file can be readable and still be unusable for your model.

Pre-Scan Files with TensorFlow Decoding

A reliable approach is to scan the file paths first and let TensorFlow try to decode each file. That catches real parser failures instead of relying on filename extensions.

python

1from pathlib import Path
2import tensorflow as tf
3
4
5def is_valid_jpeg(path: str) -> bool:
6    try:
7        data = tf.io.read_file(path)
8        image = tf.image.decode_jpeg(data, channels=3)
9        tf.ensure_shape(image, [None, None, 3])
10        return True
11    except (tf.errors.InvalidArgumentError, tf.errors.NotFoundError):
12        return False
13
14
15image_dir = Path("images")
16valid_files = []
17invalid_files = []
18
19for path in image_dir.glob("*.jpg"):
20    if is_valid_jpeg(str(path)):
21        valid_files.append(str(path))
22    else:
23        invalid_files.append(str(path))
24
25print(f"valid: {len(valid_files)}")
26print(f"invalid: {len(invalid_files)}")

This gives you visibility into which files are bad instead of silently skipping them.

Build the `tf.data` Pipeline from Valid Files

Once the valid file list is known, keep the main input pipeline strict.

python

1import tensorflow as tf
2
3
4def load_image(path):
5    data = tf.io.read_file(path)
6    image = tf.image.decode_jpeg(data, channels=3)
7    image = tf.image.resize(image, [224, 224])
8    image = tf.cast(image, tf.float32) / 255.0
9    return image
10
11
12paths = tf.data.Dataset.from_tensor_slices(valid_files)
13dataset = paths.map(load_image, num_parallel_calls=tf.data.AUTOTUNE)
14dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

This is the most predictable training setup because all corruption handling happened before the parallel pipeline started.

Use `ignore_errors()` Carefully

TensorFlow can skip bad records during pipeline execution, but that should be a conscious choice rather than the default.

python

paths = tf.data.Dataset.from_tensor_slices(valid_files + invalid_files)
dataset = paths.map(load_image, num_parallel_calls=tf.data.AUTOTUNE)
dataset = dataset.apply(tf.data.experimental.ignore_errors())

This keeps training alive, but it can also hide a broken export job or a bad download if you are not logging the rejected files somewhere else. Silent skipping is convenient, yet it reduces your visibility into data quality.

Validate the Exact Contract You Need

If your model expects JPEG files with three channels and at least a certain size, validate exactly that. Generic decoding is often not enough. A file that decodes as grayscale or that is too small may still create downstream problems during resize, augmentation, or batching.

That is why many pipelines check more than readability:

decoder compatibility with the expected format
channel count
minimum image size
shape assumptions after decoding

Common Pitfalls

Assuming file enumeration proves the images are valid.
Relying only on file extensions instead of real decoding.
Using ignore_errors() without tracking which files were skipped.
Using a generic decoder when the pipeline actually expects one specific image format.

Summary

The best corruption check is to let TensorFlow actually decode the image.
Pre-scan files when you want strong visibility into which examples are invalid.
Build the main tf.data pipeline from validated paths whenever possible.
Use ignore_errors() only when you accept the tradeoff between resilience and visibility.
Validate shape and channel assumptions, not just whether the file can be opened.

Detecting corrupt images in Tensorflow

Master System Design with Codemia

Introduction

What Counts as a Corrupt Image

Pre-Scan Files with TensorFlow Decoding

Build the tf.data Pipeline from Valid Files

Use ignore_errors() Carefully

Validate the Exact Contract You Need

Common Pitfalls

Summary

Build the `tf.data` Pipeline from Valid Files

Use `ignore_errors()` Carefully