sklearn
machine learning
data preprocessing
pipeline
feature engineering

Multiple pipelines that merge within a sklearn Pipeline?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

A standard scikit-learn Pipeline is linear, so it cannot branch and merge by itself. When preprocessing needs to split into multiple paths and then recombine, the usual solution is ColumnTransformer or FeatureUnion, with a normal Pipeline wrapped around that branching step. The important mental model is simple: sequence lives in Pipeline, branching lives in a union-style transformer.

The Right Tool for Tabular Data: ColumnTransformer

For mixed tabular data, the most common branching pattern is:

  • one pipeline for numeric columns
  • one pipeline for categorical columns
  • merge both outputs
  • feed the merged features into the estimator
python
1import pandas as pd
2
3from sklearn.compose import ColumnTransformer
4from sklearn.impute import SimpleImputer
5from sklearn.pipeline import Pipeline
6from sklearn.preprocessing import OneHotEncoder, StandardScaler
7from sklearn.linear_model import LogisticRegression
8
9X = pd.DataFrame(
10    {
11        "age": [22, 35, None, 41],
12        "income": [50000, 80000, 62000, None],
13        "city": ["Toronto", "Montreal", "Toronto", "Vancouver"],
14    }
15)
16y = [0, 1, 0, 1]
17
18numeric_features = ["age", "income"]
19categorical_features = ["city"]
20
21numeric_pipeline = Pipeline(
22    steps=[
23        ("imputer", SimpleImputer(strategy="median")),
24        ("scaler", StandardScaler()),
25    ]
26)
27
28categorical_pipeline = Pipeline(
29    steps=[
30        ("imputer", SimpleImputer(strategy="most_frequent")),
31        ("onehot", OneHotEncoder(handle_unknown="ignore")),
32    ]
33)
34
35preprocessor = ColumnTransformer(
36    transformers=[
37        ("num", numeric_pipeline, numeric_features),
38        ("cat", categorical_pipeline, categorical_features),
39    ]
40)
41
42model = Pipeline(
43    steps=[
44        ("preprocessor", preprocessor),
45        ("classifier", LogisticRegression()),
46    ]
47)
48
49model.fit(X, y)

This is the standard answer for merged preprocessing in structured-data workflows.

When to Use FeatureUnion

FeatureUnion is useful when several transformers operate on the same input and you want to concatenate their outputs.

For example, in text classification you might want TF-IDF features plus a handcrafted text-length feature:

python
1import numpy as np
2from sklearn.base import BaseEstimator, TransformerMixin
3from sklearn.feature_extraction.text import TfidfVectorizer
4from sklearn.pipeline import FeatureUnion, Pipeline
5from sklearn.linear_model import LogisticRegression
6
7class TextLengthTransformer(BaseEstimator, TransformerMixin):
8    def fit(self, X, y=None):
9        return self
10
11    def transform(self, X):
12        return np.array([[len(text)] for text in X])
13
14texts = ["great movie", "terrible ending", "funny and warm", "boring"]
15y = [1, 0, 1, 0]
16
17features = FeatureUnion(
18    transformer_list=[
19        ("tfidf", TfidfVectorizer()),
20        ("length", TextLengthTransformer()),
21    ]
22)
23
24model = Pipeline(
25    steps=[
26        ("features", features),
27        ("classifier", LogisticRegression(max_iter=1000)),
28    ]
29)
30
31model.fit(texts, y)

FeatureUnion merges parallel outputs horizontally. That is different from ColumnTransformer, which routes different subsets of columns to different branches.

Nesting Pipelines Is Normal

The composition often looks like this:

  • small branch pipeline
  • another branch pipeline
  • union or column transformer
  • final estimator pipeline

That nesting is normal. In fact, it is the cleanest way to keep preprocessing logic readable and cross-validation-safe.

Because everything stays inside scikit-learn estimators, tools such as GridSearchCV can still tune branch hyperparameters as part of the full workflow.

Why This Matters for Data Leakage

The biggest practical reason to build merged pipelines properly is not elegance. It is leakage prevention.

If imputation, scaling, encoding, and feature extraction happen inside the pipeline, cross-validation applies them separately inside each training fold. If you preprocess outside the pipeline first, you may accidentally let validation data influence the training transform.

That is one of the strongest reasons to keep even complex branching workflows inside scikit-learn's estimator API.

Common Pitfalls

  • Using plain Pipeline for a genuinely branching problem. Fix: introduce ColumnTransformer or FeatureUnion at the split point.
  • Applying the wrong transformer to the wrong columns. Fix: verify the column selectors passed into ColumnTransformer.
  • Forgetting sparse versus dense output behavior. Fix: make sure the downstream estimator can consume the merged feature matrix.
  • Doing preprocessing before cross-validation. Fix: keep transformations inside the estimator pipeline to avoid leakage.
  • Overcomplicating simple cases. Fix: use one linear pipeline when the data flow does not actually need branching.

Summary

  • Use Pipeline for sequential steps.
  • Use ColumnTransformer when different columns need different preprocessing branches.
  • Use FeatureUnion when multiple transformers should run on the same input and their outputs should be concatenated.
  • Nesting these components is normal and often the cleanest design.
  • Keeping the full workflow inside scikit-learn pipelines helps prevent data leakage during evaluation.

Course illustration
Course illustration

All Rights Reserved.