data-splitting
train-test-split
machine-learning
data-preprocessing
model-evaluation

How training and test data is split?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Training and test splitting is how you simulate the real-world question every model must answer: can it generalize to data it has never seen before? The training set teaches the model, while the test set stays hidden until evaluation time. The split itself is simple. The important part is doing it in a way that avoids leakage, preserves the data distribution, and matches the actual prediction scenario.

The Basic Idea

A dataset is commonly divided into at least two parts:

  • training data, used to fit the model
  • test data, used only for final evaluation

A simple example with scikit-learn looks like this:

python
1from sklearn.model_selection import train_test_split
2
3X_train, X_test, y_train, y_test = train_test_split(
4    X, y,
5    test_size=0.2,
6    random_state=42
7)

Here, 80 percent of the data becomes training data and 20 percent becomes test data.

Why the Test Set Must Stay Untouched

The test set exists to estimate how the trained model will behave on future unseen data. If you use the test set during tuning, feature selection, preprocessing decisions, or threshold adjustment, it stops being a fair test.

That is why many workflows also introduce a validation set or cross-validation for model selection, leaving the test set for the very end.

Common Split Ratios

There is no universal split ratio, but common patterns are:

  • 80/20
  • 70/30
  • 60/20/20 for train, validation, and test

The right ratio depends on dataset size. If you have a very large dataset, even 10 percent test data can be plenty. If you have a very small dataset, cross-validation is often more informative than a single holdout split.

Use Stratification for Classification

If the target classes are imbalanced, random splitting can produce distorted class proportions. Stratified splitting helps keep the label distribution similar across train and test sets.

python
1from sklearn.model_selection import train_test_split
2
3X_train, X_test, y_train, y_test = train_test_split(
4    X, y,
5    test_size=0.2,
6    random_state=42,
7    stratify=y
8)

This is especially important when minority classes are small. Without stratification, the test set may end up with too few examples of the rare class to evaluate meaningfully.

Split Before Fitting Preprocessing

One of the biggest rules in machine learning is: split first, fit preprocessing second.

Correct pattern:

  1. split into training and test sets
  2. fit scalers, encoders, or imputers on training data only
  3. apply the learned transforms to both training and test data

Wrong pattern:

  1. fit a scaler on the full dataset
  2. split afterward

The wrong pattern leaks information from the test set into training.

Time Series Data Is Different

For time series, random splitting is often wrong because it lets the model train on future information and test on the past.

A time-aware split should preserve chronology:

python
1X_train = X[:800]
2X_test = X[800:]
3y_train = y[:800]
4y_test = y[800:]

The general rule is that the split should reflect the actual prediction scenario. If the model will predict the future from the past, the data split should do the same.

Grouped Data Needs Grouped Splits

If multiple rows belong to the same user, patient, device, or document, random row-level splitting can leak entity-specific information across train and test.

In those cases, grouped splitting is safer so the same entity does not appear in both sets.

This is a common hidden source of inflated scores in real-world tabular problems.

Validation and Cross-Validation

A single train-test split is often enough for a first baseline. But if you need to tune hyperparameters or compare many models, use:

  • a validation set
  • or cross-validation on the training data

Then evaluate once on the test set at the end. That keeps the final estimate honest.

Common Pitfalls

A common mistake is preprocessing the entire dataset before the split. That leaks test-set information.

Another issue is using the test set repeatedly while tuning the model, which turns it into a validation set in disguise.

Developers also often forget stratification for imbalanced classification problems, leading to misleading evaluation.

Finally, random splitting is inappropriate for time series or grouped data where rows are not independent.

Summary

  • Training data is for fitting the model; test data is for final evaluation only.
  • Split the data before fitting scalers, encoders, or other preprocessing steps.
  • Use stratified splits for imbalanced classification when appropriate.
  • Use time-aware or group-aware splits when the data structure requires it.
  • Treat the test set as a final exam, not as part of model development.

Course illustration
Course illustration

All Rights Reserved.