Keras
fit function
validation_data
validation_split
machine learning

What is the relation between validation_data and validation_split in Keras' fit function?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Both validation_data and validation_split tell Keras to evaluate the model on held-out data during training, but they do it in different ways. validation_split slices a fraction out of the training arrays you pass to fit, while validation_data gives Keras an explicit separate dataset, and the explicit dataset takes priority when both are supplied.

What validation_split Does

validation_split is a convenience option for in-memory array data. If you write:

python
model.fit(x_train, y_train, validation_split=0.2, epochs=5)

Keras reserves the last 20 percent of the provided training arrays for validation and trains on the remaining 80 percent.

That means validation_split only works cleanly when:

  • the input data is indexable, such as NumPy arrays or tensors
  • training and labels are aligned row by row
  • you are comfortable letting Keras create the held-out subset for you

It is quick, but it is also less explicit.

What validation_data Does

validation_data gives Keras the validation set directly.

python
1model.fit(
2    x_train,
3    y_train,
4    validation_data=(x_val, y_val),
5    epochs=5,
6)

This is the more flexible option because your validation set can come from a separate preprocessing step, a manual split, or even a dataset object.

It is the right choice when:

  • you already created a train and validation split yourself
  • you need reproducible control over the exact validation rows
  • your validation set is generated differently from training data
  • you are using dataset pipelines rather than simple arrays

The Relation Between Them

Conceptually, both options feed validation metrics such as val_loss and val_accuracy at the end of each epoch. The difference is only in how the validation data is sourced.

A practical way to think about them is:

  • 'validation_split means "please carve validation rows out of the training arrays for me"'
  • 'validation_data means "use this validation dataset that I prepared myself"'

If both are passed, the explicit validation_data wins and the split argument is effectively ignored.

Example With A Manual Split

A manual split is often clearer, especially when you want stratification or custom preprocessing.

python
1import numpy as np
2from tensorflow import keras
3from sklearn.model_selection import train_test_split
4
5x = np.random.rand(100, 10)
6y = np.random.randint(0, 2, size=(100,))
7
8x_train, x_val, y_train, y_val = train_test_split(
9    x, y, test_size=0.2, random_state=42
10)
11
12model = keras.Sequential([
13    keras.layers.Dense(16, activation="relu", input_shape=(10,)),
14    keras.layers.Dense(1, activation="sigmoid"),
15])
16
17model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
18model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=3, verbose=0)

This approach makes the split explicit and reproducible.

When validation_split Is Fine

For small experiments with arrays already in memory, validation_split is perfectly reasonable.

python
model.fit(x, y, validation_split=0.2, epochs=3, verbose=0)

It is especially handy for quick prototypes, but it should not be confused with a carefully designed validation strategy.

Important Behavior Details

A few details matter in practice:

  • 'validation_split applies only to the arrays passed to fit'
  • it is not designed for generators or arbitrary streaming inputs
  • it uses a deterministic slice of the provided arrays rather than a complex stratified sampling step
  • 'validation_data is usually clearer when data order matters or preprocessing differs'

That last point matters for time series and grouped data, where a naive slice may produce misleading validation results.

Common Pitfalls

The most common mistake is assuming validation_split performs a sophisticated train-validation split. It does not. It is just a convenience slice on the arrays you supplied.

Another mistake is passing both validation_split and validation_data and expecting Keras to combine them. It will not. The explicit validation dataset takes precedence.

A third issue is using validation_split on ordered data such as time series without thinking about leakage or sampling bias.

Summary

  • Both options provide validation metrics during fit.
  • 'validation_split carves validation rows out of the training arrays automatically.'
  • 'validation_data uses an explicit separate validation dataset.'
  • If both are supplied, validation_data takes priority.
  • Use validation_split for quick experiments and validation_data when you need full control.

Course illustration
Course illustration