tensorflow
datasets
tensorflow_datasets
machine learning
data access

Accessing already downloaded dataset with tensorflow_datasets API

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

tensorflow_datasets, usually shortened to TFDS, already knows how to reuse prepared datasets instead of downloading them again. The key is understanding where TFDS looks, what counts as an already prepared dataset, and when to disable downloading explicitly.

Know the Difference Between Raw Downloads and Prepared Data

TFDS manages two related but different things:

  • Source files it downloads or expects you to place manually.
  • Prepared dataset artifacts stored under a TFDS data_dir.

If a dataset has already been prepared in the expected directory layout, TFDS can open it directly without fetching anything from the network. In many projects the simplest solution is just to point TFDS at the same data_dir that was used before.

python
1import tensorflow_datasets as tfds
2
3ds = tfds.load(
4    "mnist",
5    split="train",
6    data_dir="/tmp/tfds-cache",
7    download=False,
8)
9
10for example in tfds.as_numpy(ds.take(1)):
11    print(example["label"])

The important flag here is download=False. It tells TFDS to fail instead of quietly attempting a download if the prepared dataset is missing.

Use tfds.load When the Dataset Is Already Prepared

The common case is a shared cache directory on your machine, on a build agent, or inside a container volume. When the dataset version already exists there, tfds.load is enough:

python
1import tensorflow_datasets as tfds
2
3train_ds, info = tfds.load(
4    "imdb_reviews",
5    split="train",
6    data_dir="/datasets/tfds",
7    with_info=True,
8    download=False,
9)
10
11print(info.full_name)
12print(info.splits["train"].num_examples)

This is preferable to custom filesystem logic because TFDS still handles schema, splits, and decoders in a consistent way.

If you want more control, create the builder yourself:

python
1import tensorflow_datasets as tfds
2
3builder = tfds.builder("mnist", data_dir="/datasets/tfds")
4ds = builder.as_dataset(split="train")
5
6for row in tfds.as_numpy(ds.take(2)):
7    print(row["image"].shape, row["label"])

The builder API is useful when you need metadata from builder.info before deciding which split or decoding options to use.

Register Reusable Dataset Directories

TFDS can search more than one dataset directory. That helps when a team has a shared read-only cache plus a local fallback cache.

python
1import tensorflow_datasets as tfds
2
3tfds.core.add_data_dir("/mnt/shared-tfds")
4tfds.core.add_data_dir("/Users/me/tfds")
5
6builder = tfds.builder("mnist")
7print(builder.data_dir)

Once those directories are registered, builders created without an explicit data_dir can discover already prepared datasets in any registered location.

This is cleaner than hardcoding many alternate paths in application code.

Handling Datasets That Require Manual Source Files

Some TFDS datasets cannot be downloaded automatically because they require a login, license acceptance, or manual acquisition. In those cases, "already downloaded" may mean the raw archive is sitting in the dataset's manual download directory, not that TFDS has already prepared the final dataset.

The usual workflow is:

  1. Put the required files in the expected manual directory.
  2. Run download_and_prepare() once.
  3. Reuse the prepared result afterward with download=False.
python
1import tensorflow_datasets as tfds
2
3builder = tfds.builder("mnist", data_dir="/datasets/tfds")
4builder.download_and_prepare()
5
6ds = builder.as_dataset(split="train")
7print(next(iter(tfds.as_numpy(ds.take(1))))["label"])

For public datasets like mnist, manual files are not needed, but the pattern is the same once preparation is complete.

A Good Offline-Friendly Pattern

If you want a script to work both online and offline, make the behavior explicit instead of relying on hidden downloads:

python
1from pathlib import Path
2import tensorflow_datasets as tfds
3
4data_dir = Path("/datasets/tfds")
5
6builder = tfds.builder("mnist", data_dir=str(data_dir))
7
8if builder.data_path.exists():
9    ds = builder.as_dataset(split="train")
10else:
11    raise FileNotFoundError("Prepared dataset not found in /datasets/tfds")
12
13print(builder.info.name)

This makes cache requirements visible to anyone running the code in CI or production.

Common Pitfalls

  • Confusing raw source archives with TFDS-prepared dataset files. Having one does not always mean you have the other.
  • Forgetting to pass the same data_dir that was used when the dataset was originally prepared.
  • Leaving download=True as the default and then being surprised when TFDS tries the network.
  • Assuming every dataset can be prepared automatically. Some require manual source files first.
  • Hardcoding one machine-specific cache path instead of using a configurable or registered data directory.

Summary

  • TFDS can reuse prepared datasets directly as long as it can find the right data_dir.
  • Use download=False when you want offline-safe behavior and a clear failure if data is missing.
  • 'tfds.load is fine for simple cases, while tfds.builder gives more control and metadata.'
  • 'tfds.core.add_data_dir lets you register shared cache locations.'
  • Distinguish between manually downloaded source files and a fully prepared TFDS dataset.

Course illustration
Course illustration

All Rights Reserved.