Combining feature extraction classes in scikit-learn
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In scikit-learn, combining feature extraction steps is usually about turning different kinds of raw input into one matrix that a model can learn from. The clean way to do that depends on whether the feature extractors operate on different columns or on the same input in different ways.
The Two Main Tools: ColumnTransformer and FeatureUnion
Most of the time, ColumnTransformer is the right choice when your dataset has different columns with different data types. For example, one column may contain text, another may contain categories, and a third may contain numeric values.
FeatureUnion solves a different problem. It combines multiple transformers that all consume the same input and then concatenates their outputs.
A practical rule is:
- use
ColumnTransformerfor different columns - use
FeatureUnionfor multiple views of the same input
A Real Example With Mixed Data
Suppose you want to classify products using:
- a text title
- a category label
- a numeric price
You can combine these feature extractors into one pipeline.
This works because ColumnTransformer applies the right transformer to each column and then concatenates the outputs into one feature matrix.
When FeatureUnion Is the Better Tool
Sometimes you want multiple feature extractors from the same raw text. For example, you may want both TF-IDF features and a handcrafted numeric feature such as text length.
Here both transformers see the same list of strings, and FeatureUnion merges their outputs.
Why Pipelines Matter
You could run each transformer manually and concatenate matrices yourself, but pipelines are safer for three reasons:
- the preprocessing is tied to the model
- training and prediction use the same feature logic
- cross-validation does not leak preprocessing from one split into another
That last point is especially important. If you fit a text vectorizer outside the pipeline before cross-validation, you can accidentally leak vocabulary information across folds.
Choosing Between the Two
If your data starts as a table with named columns, start by asking whether each feature extractor belongs to a different column. If yes, ColumnTransformer is usually the clean answer.
If you have one input source and want several independent feature views from it, use FeatureUnion.
You can also combine them. A common pattern is a ColumnTransformer at the outer level and a nested pipeline or union inside one branch for a specific column.
Common Pitfalls
The most common mistake is using FeatureUnion when the real problem is column-specific preprocessing. That usually leads to awkward input handling.
Another mistake is fitting transformers outside a pipeline and then cross-validating only the estimator. That can leak information and make the evaluation too optimistic.
A third pitfall is forgetting that text vectorizers expect one-dimensional text input, while encoders and scalers usually expect column-shaped tabular input.
Summary
- Use
ColumnTransformerto combine feature extraction across different columns. - Use
FeatureUnionto combine multiple transformers on the same input. - Wrap feature extraction and the estimator in a
Pipelineso training and inference stay aligned. - Pipelines also help prevent data leakage during validation.
- The right combinator depends on whether your transformers operate on different columns or on different views of the same data.

