Combining feature extraction classes in scikit-learn

scikit-learn

feature extraction

machine learning

Python

data preprocessing

Combining feature extraction classes in scikit-learn

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In scikit-learn, combining feature extraction steps is usually about turning different kinds of raw input into one matrix that a model can learn from. The clean way to do that depends on whether the feature extractors operate on different columns or on the same input in different ways.

The Two Main Tools: `ColumnTransformer` and `FeatureUnion`

Most of the time, ColumnTransformer is the right choice when your dataset has different columns with different data types. For example, one column may contain text, another may contain categories, and a third may contain numeric values.

FeatureUnion solves a different problem. It combines multiple transformers that all consume the same input and then concatenates their outputs.

A practical rule is:

use ColumnTransformer for different columns
use FeatureUnion for multiple views of the same input

A Real Example With Mixed Data

Suppose you want to classify products using:

a text title
a category label
a numeric price

You can combine these feature extractors into one pipeline.

python

1import pandas as pd
2from sklearn.compose import ColumnTransformer
3from sklearn.feature_extraction.text import TfidfVectorizer
4from sklearn.linear_model import LogisticRegression
5from sklearn.pipeline import Pipeline
6from sklearn.preprocessing import OneHotEncoder, StandardScaler
7
8X = pd.DataFrame(
9    {
10        "title": [
11            "fresh apples",
12            "gaming laptop",
13            "sweet oranges",
14            "office keyboard",
15        ],
16        "category": ["fruit", "electronics", "fruit", "electronics"],
17        "price": [1.5, 1200.0, 2.0, 80.0],
18    }
19)
20y = [0, 1, 0, 1]
21
22preprocess = ColumnTransformer(
23    transformers=[
24        ("title_tfidf", TfidfVectorizer(), "title"),
25        ("category_ohe", OneHotEncoder(handle_unknown="ignore"), ["category"]),
26        ("price_scale", StandardScaler(), ["price"]),
27    ]
28)
29
30model = Pipeline(
31    steps=[
32        ("features", preprocess),
33        ("classifier", LogisticRegression(max_iter=1000)),
34    ]
35)
36
37model.fit(X, y)
38print(model.predict(X))

This works because ColumnTransformer applies the right transformer to each column and then concatenates the outputs into one feature matrix.

When `FeatureUnion` Is the Better Tool

Sometimes you want multiple feature extractors from the same raw text. For example, you may want both TF-IDF features and a handcrafted numeric feature such as text length.

python

1import numpy as np
2from sklearn.base import BaseEstimator, TransformerMixin
3from sklearn.feature_extraction.text import TfidfVectorizer
4from sklearn.pipeline import FeatureUnion, Pipeline
5from sklearn.linear_model import LogisticRegression
6
7texts = ["short note", "very long technical document", "tiny", "another long article"]
8y = [0, 1, 0, 1]
9
10class TextLengthTransformer(BaseEstimator, TransformerMixin):
11    def fit(self, X, y=None):
12        return self
13
14    def transform(self, X):
15        return np.array([[len(text)] for text in X], dtype=float)
16
17features = FeatureUnion(
18    transformer_list=[
19        ("tfidf", TfidfVectorizer()),
20        ("length", TextLengthTransformer()),
21    ]
22)
23
24model = Pipeline(
25    steps=[
26        ("features", features),
27        ("classifier", LogisticRegression(max_iter=1000)),
28    ]
29)
30
31model.fit(texts, y)
32print(model.predict(texts))

Here both transformers see the same list of strings, and FeatureUnion merges their outputs.

Why Pipelines Matter

You could run each transformer manually and concatenate matrices yourself, but pipelines are safer for three reasons:

the preprocessing is tied to the model
training and prediction use the same feature logic
cross-validation does not leak preprocessing from one split into another

That last point is especially important. If you fit a text vectorizer outside the pipeline before cross-validation, you can accidentally leak vocabulary information across folds.

Choosing Between the Two

If your data starts as a table with named columns, start by asking whether each feature extractor belongs to a different column. If yes, ColumnTransformer is usually the clean answer.

If you have one input source and want several independent feature views from it, use FeatureUnion.

You can also combine them. A common pattern is a ColumnTransformer at the outer level and a nested pipeline or union inside one branch for a specific column.

Common Pitfalls

The most common mistake is using FeatureUnion when the real problem is column-specific preprocessing. That usually leads to awkward input handling.

Another mistake is fitting transformers outside a pipeline and then cross-validating only the estimator. That can leak information and make the evaluation too optimistic.

A third pitfall is forgetting that text vectorizers expect one-dimensional text input, while encoders and scalers usually expect column-shaped tabular input.

Summary

Use ColumnTransformer to combine feature extraction across different columns.
Use FeatureUnion to combine multiple transformers on the same input.
Wrap feature extraction and the estimator in a Pipeline so training and inference stay aligned.
Pipelines also help prevent data leakage during validation.
The right combinator depends on whether your transformers operate on different columns or on different views of the same data.

Combining feature extraction classes in scikit-learn

Master System Design with Codemia

Introduction

The Two Main Tools: ColumnTransformer and FeatureUnion

A Real Example With Mixed Data

When FeatureUnion Is the Better Tool

Why Pipelines Matter

Choosing Between the Two

Common Pitfalls

Summary

The Two Main Tools: `ColumnTransformer` and `FeatureUnion`

When `FeatureUnion` Is the Better Tool