Label encoding across multiple columns in scikit-learn
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Encoding categorical features is part of almost every tabular machine-learning pipeline. The important detail is that categories belong to separate columns, so an encoder must preserve that separation instead of mixing all labels into one shared integer space.
Core Sections
Why LabelEncoder is usually the wrong tool for feature columns
LabelEncoder was designed for target labels such as a class column, not for an entire feature matrix. It maps one flat list of values to integers. If you reuse a single LabelEncoder across several feature columns, the mappings become column-dependent and difficult to reason about.
For example, red, XL, and Canada are unrelated categories even though all are strings. Treating them as members of one category pool creates arbitrary numeric codes that do not reflect feature meaning.
Instead, use encoders that understand columns explicitly:
- '
OrdinalEncoderfor one integer code per category per column' - '
OneHotEncoderwhen category order should not be implied' - '
ColumnTransformerto combine categorical and numeric preprocessing cleanly'
Encode multiple columns with OrdinalEncoder
OrdinalEncoder works across several columns at once and stores a separate category list for each column.
This produces a numeric matrix with one mapping per original column. The categories_ attribute is important because it tells you exactly how each column was encoded.
Ordinal encoding is often acceptable for tree-based models because they split on thresholds and are less sensitive to arbitrary spacing between category ids. Even then, the ordering is still artificial, so measure performance instead of assuming it is harmless.
Prefer OneHotEncoder for linear models and distance-based models
Many models interpret larger numbers as larger values. That makes ordinal encoding risky when the categories have no real order. OneHotEncoder avoids that problem by expanding each category into a separate binary feature.
Now red is not “greater than” blue. The model sees independent indicator columns instead of invented numeric ranks.
Put encoding inside a pipeline
The safest production pattern is to keep preprocessing inside a Pipeline so training and inference use the same fitted encoder.
This structure also makes persistence straightforward because the preprocessing and model live in one fitted object.
Handle unknown categories deliberately
Real input data changes. If training saw red, blue, and green, production may later send black. If the encoder is not configured for unknown categories, inference can fail unexpectedly.
For ordinal encoding, use handle_unknown="use_encoded_value" and a reserved value like -1. For one-hot encoding, use handle_unknown="ignore". Then verify downstream code can tolerate that representation.
You should also fit on the training split only. Fitting on the full dataset leaks information from validation or test data into the preprocessing step.
Save the fitted preprocessing with the model
If you save only the estimator and rebuild the encoder later, category ordering can drift. Persist the full pipeline instead.
That keeps the fitted category mapping stable across training and deployment.
Common Pitfalls
- Reusing one
LabelEncoderinstance across unrelated feature columns, which mixes category spaces and creates unstable mappings. - Fitting encoders before the train and test split, which leaks information and inflates validation scores.
- Using ordinal encoding for categories with no real order when the model interprets larger values as meaningful.
- Forgetting to define behavior for unseen categories, which causes inference-time crashes on new input values.
- Saving only the trained estimator and not the preprocessing pipeline, which breaks category consistency after deployment.
Summary
- '
LabelEncoderis mainly for target labels, not multi-column feature tables.' - Use
OrdinalEncoderorOneHotEncoderso each categorical column keeps its own mapping. - Prefer
OneHotEncoderwhen category order should not influence the model. - Wrap preprocessing in a
PipelineandColumnTransformerfor reproducible training and inference. - Persist the fitted preprocessing with the model to prevent category drift.

