How to prepare a dataset for Keras?

Keras

dataset preparation

machine learning

deep learning

data preprocessing

How to prepare a dataset for Keras?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Preparing data for Keras involves loading raw data, cleaning it, splitting it into training and test sets, normalizing numerical features, encoding categorical variables, and reshaping arrays to match the model's expected input shape. Keras and TensorFlow provide utilities like tf.data.Dataset, ImageDataGenerator, and tf.keras.utils.text_dataset_from_directory to streamline this process. Proper preprocessing directly impacts model accuracy and training stability.

Loading Data

python

1import pandas as pd
2import numpy as np
3from sklearn.model_selection import train_test_split
4
5# From CSV
6df = pd.read_csv("data.csv")
7X = df.drop("target", axis=1).values  # Features as NumPy array
8y = df["target"].values                # Labels as NumPy array
9
10# From a built-in dataset
11from tensorflow.keras.datasets import mnist
12(X_train, y_train), (X_test, y_test) = mnist.load_data()
13print(X_train.shape)  # (60000, 28, 28)

Keras expects NumPy arrays or tf.data.Dataset objects. Convert DataFrames to arrays with .values or .to_numpy().

Train/Test Split

python

1X_train, X_test, y_train, y_test = train_test_split(
2    X, y, test_size=0.2, random_state=42, stratify=y
3)
4print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Use stratify=y for classification tasks to maintain the same class distribution in both splits.

Normalizing Numerical Features

python

1# Scale to [0, 1] for image data
2X_train = X_train.astype("float32") / 255.0
3X_test = X_test.astype("float32") / 255.0
4
5# StandardScaler for tabular data
6from sklearn.preprocessing import StandardScaler
7
8scaler = StandardScaler()
9X_train = scaler.fit_transform(X_train)  # Fit on train only
10X_test = scaler.transform(X_test)        # Transform test with same params

Always fit the scaler on the training set and apply it to the test set. Fitting on the full dataset leaks test information into training.

Encoding Categorical Labels

python

1from tensorflow.keras.utils import to_categorical
2
3# For multi-class classification: one-hot encode
4y_train_cat = to_categorical(y_train, num_classes=10)
5y_test_cat = to_categorical(y_test, num_classes=10)
6print(y_train_cat[0])  # [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] (if label was 5)
7
8# For binary classification: keep as 0/1
9# No encoding needed — use sigmoid output with binary_crossentropy

Use one-hot encoding with categorical_crossentropy loss or integer labels with sparse_categorical_crossentropy.

Reshaping for CNNs

python

1# MNIST images: (60000, 28, 28) → (60000, 28, 28, 1) for CNN
2X_train = X_train.reshape(-1, 28, 28, 1)
3X_test = X_test.reshape(-1, 28, 28, 1)
4print(X_train.shape)  # (60000, 28, 28, 1)
5
6# RGB images: ensure shape is (samples, height, width, 3)
7# Grayscale: ensure shape is (samples, height, width, 1)

CNNs expect 4D input: (batch_size, height, width, channels). Add the channel dimension with reshape or np.expand_dims.

Using tf.data.Dataset

python

1import tensorflow as tf
2
3# Create a dataset from NumPy arrays
4train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
5train_ds = train_ds.shuffle(buffer_size=10000).batch(32).prefetch(tf.data.AUTOTUNE)
6
7test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
8test_ds = test_ds.batch(32).prefetch(tf.data.AUTOTUNE)
9
10# Train with dataset
11model.fit(train_ds, validation_data=test_ds, epochs=10)

tf.data.Dataset provides efficient batching, shuffling, and prefetching for large datasets that do not fit in memory.

Image Data with ImageDataGenerator

python

1from tensorflow.keras.preprocessing.image import ImageDataGenerator
2
3# Data augmentation for training
4train_gen = ImageDataGenerator(
5    rescale=1.0 / 255,
6    rotation_range=15,
7    width_shift_range=0.1,
8    height_shift_range=0.1,
9    horizontal_flip=True
10)
11
12test_gen = ImageDataGenerator(rescale=1.0 / 255)
13
14train_data = train_gen.flow_from_directory(
15    "data/train", target_size=(224, 224), batch_size=32, class_mode="categorical"
16)
17
18test_data = test_gen.flow_from_directory(
19    "data/test", target_size=(224, 224), batch_size=32, class_mode="categorical"
20)
21
22model.fit(train_data, validation_data=test_data, epochs=10)

Text Data Preparation

python

1from tensorflow.keras.preprocessing.text import Tokenizer
2from tensorflow.keras.preprocessing.sequence import pad_sequences
3
4texts = ["I love this movie", "Terrible film", "Great acting and story"]
5labels = [1, 0, 1]
6
7tokenizer = Tokenizer(num_words=10000)
8tokenizer.fit_on_texts(texts)
9sequences = tokenizer.texts_to_sequences(texts)
10# [[1, 2, 3, 4], [5, 6], [7, 8, 9, 10]]
11
12X = pad_sequences(sequences, maxlen=20, padding="post")
13print(X.shape)  # (3, 20)

Common Pitfalls

Fitting the scaler on the entire dataset: Fitting StandardScaler or MinMaxScaler on both train and test data leaks information. Always call fit_transform on training data only, then transform on test data.
Forgetting to convert labels for the loss function: Using categorical_crossentropy with integer labels (not one-hot) causes shape mismatches. Either one-hot encode with to_categorical or use sparse_categorical_crossentropy with integer labels.
Wrong input shape for CNNs: Passing 3D arrays (samples, height, width) to a Conv2D layer that expects 4D (samples, height, width, channels) causes an error. Always reshape grayscale images to include the channel dimension.
Not shuffling the training data: If training data is ordered by class (all class 0 first, then class 1, etc.), the model learns in a biased way per batch. Shuffle with train_test_split(shuffle=True) or dataset.shuffle().
Using augmentation on validation/test data: Data augmentation (rotation, flips, shifts) should only be applied to training data. The validation and test sets should use only rescaling to provide consistent evaluation metrics.

Summary

Convert data to NumPy arrays or tf.data.Dataset objects before passing to Keras
Split into train/test with train_test_split using stratify for classification
Normalize features (divide by 255 for images, StandardScaler for tabular data) — fit on train only
One-hot encode labels for categorical_crossentropy, or use integer labels with sparse_categorical_crossentropy
Reshape images to 4D (samples, height, width, channels) for CNN models
Use tf.data.Dataset for large datasets and ImageDataGenerator for image augmentation