numpy
tensorflow
data-loading
machine-learning
python

How to read data from numpy files in TensorFlow?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

TensorFlow can consume NumPy data very naturally, but the best loading strategy depends on the dataset size. For small and medium datasets, the simplest path is to load the arrays with NumPy and feed them into tf.data. For very large arrays, memory mapping or a generator-based pipeline is usually more practical.

The good news is that you do not need a special TensorFlow-only reader for .npy or .npz files. NumPy already knows how to read them, and TensorFlow interoperates with NumPy arrays directly.

Load .npy and .npz Files with NumPy

A .npy file stores a single array. A .npz file stores multiple named arrays in one archive. In both cases, NumPy should be your first step.

python
1import numpy as np
2import tensorflow as tf
3
4features = np.load("features.npy")
5labels = np.load("labels.npy")
6
7features_tensor = tf.convert_to_tensor(features)
8labels_tensor = tf.convert_to_tensor(labels)
9
10print(features_tensor.shape)
11print(labels_tensor.shape)

For an .npz archive:

python
1import numpy as np
2
3archive = np.load("dataset.npz")
4print(archive.files)
5
6x_train = archive["x_train"]
7y_train = archive["y_train"]

This approach is ideal when the files fit comfortably in memory and you are doing experiments, notebooks, or smaller training runs.

Build a tf.data Pipeline for Training

Most TensorFlow training code is easier to manage when the arrays are wrapped in a dataset pipeline. That gives you batching, shuffling, and prefetching without changing the underlying file format.

python
1import numpy as np
2import tensorflow as tf
3
4x = np.load("features.npy")
5y = np.load("labels.npy")
6
7dataset = tf.data.Dataset.from_tensor_slices((x, y))
8dataset = dataset.shuffle(buffer_size=len(x))
9dataset = dataset.batch(32)
10dataset = dataset.prefetch(tf.data.AUTOTUNE)
11
12for batch_x, batch_y in dataset.take(1):
13    print(batch_x.shape, batch_y.shape)

If you already have the full arrays in memory, this is usually the cleanest answer. It keeps the input code small while giving you the standard TensorFlow input pipeline behavior.

Use Memory Mapping for Large Arrays

Large .npy files can be expensive to load eagerly. NumPy supports memory mapping, which lets you access slices without pulling the entire file into RAM immediately.

python
1import numpy as np
2
3x = np.load("huge_features.npy", mmap_mode="r")
4print(x.shape)
5print(x[0])

A memory-mapped array behaves like an array for indexing, but the OS pages data in as needed. That makes it useful when the dataset is too large for a full np.load, but the data still lives in a format you want to keep.

To combine that with TensorFlow, wrap the access pattern in a generator:

python
1import numpy as np
2import tensorflow as tf
3
4x = np.load("features.npy", mmap_mode="r")
5y = np.load("labels.npy", mmap_mode="r")
6
7
8def generator():
9    for i in range(len(x)):
10        yield x[i], y[i]
11
12
13dataset = tf.data.Dataset.from_generator(
14    generator,
15    output_signature=(
16        tf.TensorSpec(shape=x.shape[1:], dtype=tf.as_dtype(x.dtype)),
17        tf.TensorSpec(shape=y.shape[1:] if y.ndim > 1 else (), dtype=tf.as_dtype(y.dtype)),
18    ),
19)
20
21dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

This gives you a streaming-style pipeline without rewriting the dataset into another format.

Know When NumPy Files Stop Scaling

.npy and .npz are very convenient, but they are still general-purpose array formats. When training becomes distributed, when files become extremely large, or when random access across many shards matters, a TensorFlow-native format such as TFRecord can scale better operationally.

That does not mean NumPy is wrong. It just means the most convenient format for local experimentation is not always the best format for long-term production pipelines.

Common Pitfalls

The most common mistake is loading a huge array eagerly and running out of memory. If the file size is large, choose memory mapping or chunked processing before you hit that limit.

Another issue is mismatched shapes between features and labels. TensorFlow can build a dataset object happily, but training later fails if the sample counts do not line up.

It is also easy to overcomplicate the problem. If the arrays fit in memory, np.load plus from_tensor_slices is usually enough. You do not need a custom parser just because the training framework is TensorFlow.

Finally, pay attention to dtypes. NumPy defaults and model expectations do not always match, so cast arrays deliberately when the model expects float32, integer labels, or normalized input ranges.

Summary

  • Read .npy and .npz files with NumPy first.
  • Convert arrays directly to tensors or wrap them in tf.data.Dataset.
  • Use memory mapping when eager loading would consume too much RAM.
  • Validate shapes and dtypes before starting training.
  • Move to a more streaming-friendly format only when the NumPy workflow stops scaling.

Course illustration
Course illustration

All Rights Reserved.