How to prepare a dataset for Keras?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Preparing data for Keras involves loading raw data, cleaning it, splitting it into training and test sets, normalizing numerical features, encoding categorical variables, and reshaping arrays to match the model's expected input shape. Keras and TensorFlow provide utilities like tf.data.Dataset, ImageDataGenerator, and tf.keras.utils.text_dataset_from_directory to streamline this process. Proper preprocessing directly impacts model accuracy and training stability.
Loading Data
Keras expects NumPy arrays or tf.data.Dataset objects. Convert DataFrames to arrays with .values or .to_numpy().
Train/Test Split
Use stratify=y for classification tasks to maintain the same class distribution in both splits.
Normalizing Numerical Features
Always fit the scaler on the training set and apply it to the test set. Fitting on the full dataset leaks test information into training.
Encoding Categorical Labels
Use one-hot encoding with categorical_crossentropy loss or integer labels with sparse_categorical_crossentropy.
Reshaping for CNNs
CNNs expect 4D input: (batch_size, height, width, channels). Add the channel dimension with reshape or np.expand_dims.
Using tf.data.Dataset
tf.data.Dataset provides efficient batching, shuffling, and prefetching for large datasets that do not fit in memory.
Image Data with ImageDataGenerator
Text Data Preparation
Common Pitfalls
- Fitting the scaler on the entire dataset: Fitting
StandardScalerorMinMaxScaleron both train and test data leaks information. Always callfit_transformon training data only, thentransformon test data. - Forgetting to convert labels for the loss function: Using
categorical_crossentropywith integer labels (not one-hot) causes shape mismatches. Either one-hot encode withto_categoricalor usesparse_categorical_crossentropywith integer labels. - Wrong input shape for CNNs: Passing 3D arrays
(samples, height, width)to a Conv2D layer that expects 4D(samples, height, width, channels)causes an error. Always reshape grayscale images to include the channel dimension. - Not shuffling the training data: If training data is ordered by class (all class 0 first, then class 1, etc.), the model learns in a biased way per batch. Shuffle with
train_test_split(shuffle=True)ordataset.shuffle(). - Using augmentation on validation/test data: Data augmentation (rotation, flips, shifts) should only be applied to training data. The validation and test sets should use only rescaling to provide consistent evaluation metrics.
Summary
- Convert data to NumPy arrays or
tf.data.Datasetobjects before passing to Keras - Split into train/test with
train_test_splitusingstratifyfor classification - Normalize features (divide by 255 for images,
StandardScalerfor tabular data) — fit on train only - One-hot encode labels for
categorical_crossentropy, or use integer labels withsparse_categorical_crossentropy - Reshape images to 4D
(samples, height, width, channels)for CNN models - Use
tf.data.Datasetfor large datasets andImageDataGeneratorfor image augmentation

