Label encoding across multiple columns in scikit-learn

label encoding

scikit-learn

machine learning

data preprocessing

python

Label encoding across multiple columns in scikit-learn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Encoding categorical features is part of almost every tabular machine-learning pipeline. The important detail is that categories belong to separate columns, so an encoder must preserve that separation instead of mixing all labels into one shared integer space.

Core Sections

Why `LabelEncoder` is usually the wrong tool for feature columns

LabelEncoder was designed for target labels such as a class column, not for an entire feature matrix. It maps one flat list of values to integers. If you reuse a single LabelEncoder across several feature columns, the mappings become column-dependent and difficult to reason about.

For example, red, XL, and Canada are unrelated categories even though all are strings. Treating them as members of one category pool creates arbitrary numeric codes that do not reflect feature meaning.

Instead, use encoders that understand columns explicitly:

'OrdinalEncoder for one integer code per category per column'
'OneHotEncoder when category order should not be implied'
'ColumnTransformer to combine categorical and numeric preprocessing cleanly'

Encode multiple columns with `OrdinalEncoder`

OrdinalEncoder works across several columns at once and stores a separate category list for each column.

python

1import pandas as pd
2from sklearn.preprocessing import OrdinalEncoder
3
4X = pd.DataFrame(
5    {
6        "color": ["red", "blue", "green", "red"],
7        "size": ["S", "M", "L", "XL"],
8        "material": ["cotton", "wool", "cotton", "linen"],
9    }
10)
11
12encoder = OrdinalEncoder(
13    handle_unknown="use_encoded_value",
14    unknown_value=-1,
15)
16
17encoded = encoder.fit_transform(X)
18print(encoded)
19print(encoder.categories_)

This produces a numeric matrix with one mapping per original column. The categories_ attribute is important because it tells you exactly how each column was encoded.

Ordinal encoding is often acceptable for tree-based models because they split on thresholds and are less sensitive to arbitrary spacing between category ids. Even then, the ordering is still artificial, so measure performance instead of assuming it is harmless.

Prefer `OneHotEncoder` for linear models and distance-based models

Many models interpret larger numbers as larger values. That makes ordinal encoding risky when the categories have no real order. OneHotEncoder avoids that problem by expanding each category into a separate binary feature.

python

1from sklearn.preprocessing import OneHotEncoder
2
3ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
4encoded_ohe = ohe.fit_transform(X[["color", "size"]])
5
6print(encoded_ohe)
7print(ohe.get_feature_names_out())

Now red is not “greater than” blue. The model sees independent indicator columns instead of invented numeric ranks.

Put encoding inside a pipeline

The safest production pattern is to keep preprocessing inside a Pipeline so training and inference use the same fitted encoder.

python

1import pandas as pd
2from sklearn.compose import ColumnTransformer
3from sklearn.linear_model import LogisticRegression
4from sklearn.pipeline import Pipeline
5from sklearn.preprocessing import OneHotEncoder, StandardScaler
6
7X = pd.DataFrame(
8    {
9        "color": ["red", "blue", "green", "red"],
10        "size": ["S", "M", "L", "XL"],
11        "weight": [1.1, 2.0, 2.7, 3.5],
12    }
13)
14y = [0, 1, 0, 1]
15
16preprocess = ColumnTransformer(
17    transformers=[
18        ("cat", OneHotEncoder(handle_unknown="ignore"), ["color", "size"]),
19        ("num", StandardScaler(), ["weight"]),
20    ]
21)
22
23model = Pipeline(
24    steps=[
25        ("preprocess", preprocess),
26        ("classifier", LogisticRegression(max_iter=500)),
27    ]
28)
29
30model.fit(X, y)
31print(model.predict(X.head(2)))

This structure also makes persistence straightforward because the preprocessing and model live in one fitted object.

Handle unknown categories deliberately

Real input data changes. If training saw red, blue, and green, production may later send black. If the encoder is not configured for unknown categories, inference can fail unexpectedly.

For ordinal encoding, use handle_unknown="use_encoded_value" and a reserved value like -1. For one-hot encoding, use handle_unknown="ignore". Then verify downstream code can tolerate that representation.

You should also fit on the training split only. Fitting on the full dataset leaks information from validation or test data into the preprocessing step.

Save the fitted preprocessing with the model

If you save only the estimator and rebuild the encoder later, category ordering can drift. Persist the full pipeline instead.

python

1import joblib
2
3joblib.dump(model, "product_model.joblib")
4loaded_model = joblib.load("product_model.joblib")
5print(loaded_model.predict(X.tail(1)))

That keeps the fitted category mapping stable across training and deployment.

Common Pitfalls

Reusing one LabelEncoder instance across unrelated feature columns, which mixes category spaces and creates unstable mappings.
Fitting encoders before the train and test split, which leaks information and inflates validation scores.
Using ordinal encoding for categories with no real order when the model interprets larger values as meaningful.
Forgetting to define behavior for unseen categories, which causes inference-time crashes on new input values.
Saving only the trained estimator and not the preprocessing pipeline, which breaks category consistency after deployment.

Summary

'LabelEncoder is mainly for target labels, not multi-column feature tables.'
Use OrdinalEncoder or OneHotEncoder so each categorical column keeps its own mapping.
Prefer OneHotEncoder when category order should not influence the model.
Wrap preprocessing in a Pipeline and ColumnTransformer for reproducible training and inference.
Persist the fitted preprocessing with the model to prevent category drift.

Label encoding across multiple columns in scikit-learn

Master System Design with Codemia

Introduction

Core Sections

Why LabelEncoder is usually the wrong tool for feature columns

Encode multiple columns with OrdinalEncoder

Prefer OneHotEncoder for linear models and distance-based models

Put encoding inside a pipeline

Handle unknown categories deliberately

Save the fitted preprocessing with the model

Common Pitfalls

Summary

Why `LabelEncoder` is usually the wrong tool for feature columns

Encode multiple columns with `OrdinalEncoder`

Prefer `OneHotEncoder` for linear models and distance-based models