Multiple pipelines that merge within a sklearn Pipeline?

sklearn

machine learning

data preprocessing

pipeline

feature engineering

Multiple pipelines that merge within a sklearn Pipeline?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A standard scikit-learn Pipeline is linear, so it cannot branch and merge by itself. When preprocessing needs to split into multiple paths and then recombine, the usual solution is ColumnTransformer or FeatureUnion, with a normal Pipeline wrapped around that branching step. The important mental model is simple: sequence lives in Pipeline, branching lives in a union-style transformer.

The Right Tool for Tabular Data: `ColumnTransformer`

For mixed tabular data, the most common branching pattern is:

one pipeline for numeric columns
one pipeline for categorical columns
merge both outputs
feed the merged features into the estimator

python

1import pandas as pd
2
3from sklearn.compose import ColumnTransformer
4from sklearn.impute import SimpleImputer
5from sklearn.pipeline import Pipeline
6from sklearn.preprocessing import OneHotEncoder, StandardScaler
7from sklearn.linear_model import LogisticRegression
8
9X = pd.DataFrame(
10    {
11        "age": [22, 35, None, 41],
12        "income": [50000, 80000, 62000, None],
13        "city": ["Toronto", "Montreal", "Toronto", "Vancouver"],
14    }
15)
16y = [0, 1, 0, 1]
17
18numeric_features = ["age", "income"]
19categorical_features = ["city"]
20
21numeric_pipeline = Pipeline(
22    steps=[
23        ("imputer", SimpleImputer(strategy="median")),
24        ("scaler", StandardScaler()),
25    ]
26)
27
28categorical_pipeline = Pipeline(
29    steps=[
30        ("imputer", SimpleImputer(strategy="most_frequent")),
31        ("onehot", OneHotEncoder(handle_unknown="ignore")),
32    ]
33)
34
35preprocessor = ColumnTransformer(
36    transformers=[
37        ("num", numeric_pipeline, numeric_features),
38        ("cat", categorical_pipeline, categorical_features),
39    ]
40)
41
42model = Pipeline(
43    steps=[
44        ("preprocessor", preprocessor),
45        ("classifier", LogisticRegression()),
46    ]
47)
48
49model.fit(X, y)

This is the standard answer for merged preprocessing in structured-data workflows.

When to Use `FeatureUnion`

FeatureUnion is useful when several transformers operate on the same input and you want to concatenate their outputs.

For example, in text classification you might want TF-IDF features plus a handcrafted text-length feature:

python

1import numpy as np
2from sklearn.base import BaseEstimator, TransformerMixin
3from sklearn.feature_extraction.text import TfidfVectorizer
4from sklearn.pipeline import FeatureUnion, Pipeline
5from sklearn.linear_model import LogisticRegression
6
7class TextLengthTransformer(BaseEstimator, TransformerMixin):
8    def fit(self, X, y=None):
9        return self
10
11    def transform(self, X):
12        return np.array([[len(text)] for text in X])
13
14texts = ["great movie", "terrible ending", "funny and warm", "boring"]
15y = [1, 0, 1, 0]
16
17features = FeatureUnion(
18    transformer_list=[
19        ("tfidf", TfidfVectorizer()),
20        ("length", TextLengthTransformer()),
21    ]
22)
23
24model = Pipeline(
25    steps=[
26        ("features", features),
27        ("classifier", LogisticRegression(max_iter=1000)),
28    ]
29)
30
31model.fit(texts, y)

FeatureUnion merges parallel outputs horizontally. That is different from ColumnTransformer, which routes different subsets of columns to different branches.

Nesting Pipelines Is Normal

The composition often looks like this:

small branch pipeline
another branch pipeline
union or column transformer
final estimator pipeline

That nesting is normal. In fact, it is the cleanest way to keep preprocessing logic readable and cross-validation-safe.

Because everything stays inside scikit-learn estimators, tools such as GridSearchCV can still tune branch hyperparameters as part of the full workflow.

Why This Matters for Data Leakage

The biggest practical reason to build merged pipelines properly is not elegance. It is leakage prevention.

If imputation, scaling, encoding, and feature extraction happen inside the pipeline, cross-validation applies them separately inside each training fold. If you preprocess outside the pipeline first, you may accidentally let validation data influence the training transform.

That is one of the strongest reasons to keep even complex branching workflows inside scikit-learn's estimator API.

Common Pitfalls

Using plain Pipeline for a genuinely branching problem. Fix: introduce ColumnTransformer or FeatureUnion at the split point.
Applying the wrong transformer to the wrong columns. Fix: verify the column selectors passed into ColumnTransformer.
Forgetting sparse versus dense output behavior. Fix: make sure the downstream estimator can consume the merged feature matrix.
Doing preprocessing before cross-validation. Fix: keep transformations inside the estimator pipeline to avoid leakage.
Overcomplicating simple cases. Fix: use one linear pipeline when the data flow does not actually need branching.

Summary

Use Pipeline for sequential steps.
Use ColumnTransformer when different columns need different preprocessing branches.
Use FeatureUnion when multiple transformers should run on the same input and their outputs should be concatenated.
Nesting these components is normal and often the cleanest design.
Keeping the full workflow inside scikit-learn pipelines helps prevent data leakage during evaluation.

Multiple pipelines that merge within a sklearn Pipeline?

Master System Design with Codemia

Introduction

The Right Tool for Tabular Data: ColumnTransformer

When to Use FeatureUnion

Nesting Pipelines Is Normal

Why This Matters for Data Leakage

Common Pitfalls

Summary

The Right Tool for Tabular Data: `ColumnTransformer`

When to Use `FeatureUnion`