data normalization
MinMaxScaler
sklearn
machine learning preprocessing
train test split

How to normalize the Train and Test data using MinMaxScaler sklearn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When you use MinMaxScaler, the important rule is simple: fit on the training data only, then apply that same fitted scaler to both train and test data. If you fit separately on the test set, you leak information and distort the evaluation.

What MinMaxScaler Does

MinMaxScaler rescales each feature into a chosen range, usually 0 to 1. It does this by learning the minimum and maximum value of each feature from the training set.

A basic example:

python
1from sklearn.preprocessing import MinMaxScaler
2import numpy as np
3
4X_train = np.array([[1.0], [2.0], [4.0], [8.0]])
5
6scaler = MinMaxScaler()
7X_train_scaled = scaler.fit_transform(X_train)
8
9print(X_train_scaled)

After fitting, the scaler stores the training-set statistics and uses them later during transform.

Correct Train and Test Workflow

The standard pattern is:

  1. split the data
  2. fit the scaler on X_train
  3. transform both X_train and X_test
python
1from sklearn.model_selection import train_test_split
2from sklearn.preprocessing import MinMaxScaler
3import numpy as np
4
5X = np.array([
6    [10.0, 100.0],
7    [20.0, 300.0],
8    [15.0, 200.0],
9    [30.0, 500.0],
10])
11y = np.array([0, 1, 0, 1])
12
13X_train, X_test, y_train, y_test = train_test_split(
14    X, y, test_size=0.5, random_state=42
15)
16
17scaler = MinMaxScaler()
18X_train_scaled = scaler.fit_transform(X_train)
19X_test_scaled = scaler.transform(X_test)
20
21print(X_train_scaled)
22print(X_test_scaled)

Notice that fit_transform is used only on X_train. The test data gets only transform.

Why Fitting on Test Data Is Wrong

If you call fit_transform on the test set as well, the test set gets scaled using its own minimum and maximum values. That leaks information from the test distribution into preprocessing and makes the evaluation less realistic.

The model should see test data processed the same way new unseen production data would be processed: using training-time statistics only.

Use a Pipeline When Possible

The safest way to avoid mistakes is to use a pipeline:

python
1from sklearn.pipeline import Pipeline
2from sklearn.preprocessing import MinMaxScaler
3from sklearn.linear_model import LogisticRegression
4
5pipeline = Pipeline([
6    ("scaler", MinMaxScaler()),
7    ("model", LogisticRegression())
8])
9
10pipeline.fit(X_train, y_train)
11predictions = pipeline.predict(X_test)

A pipeline ensures that the scaler is fitted only on training folds during cross-validation and only on the training split during a normal fit.

That matters a lot once your workflow grows beyond one simple split, because leakage bugs become much harder to spot by inspection alone.

Inverse Transform Can Restore Original Scale

If you need to convert scaled values back to their original units, use inverse_transform:

python
original_values = scaler.inverse_transform(X_train_scaled)
print(original_values)

This is useful for debugging and for interpreting predictions in the original feature space.

It is also helpful when your model predicts scaled numeric targets and you need to present results back in human-readable units.

Common Pitfalls

One common mistake is scaling the full dataset before the train-test split. That leaks information from the future test set into training.

Another issue is fitting separate scalers to train and test data. That makes the feature spaces inconsistent.

It is also easy to forget that values in the test set can fall outside the training-set range. In that case, the transformed test values can be less than 0 or greater than 1, and that is expected behavior.

Summary

  • Fit MinMaxScaler on the training data only.
  • Use the fitted scaler to transform both training and test features.
  • Do not call fit on the test set or on the full dataset before splitting.
  • Pipelines are the safest way to prevent preprocessing leakage.
  • Test values outside the training range can scale outside 0 to 1, and that is normal.

Course illustration
Course illustration

All Rights Reserved.