How to create pandas output for custom transformers?

pandas

data transformation

custom transformers

Python

data processing

How to create pandas output for custom transformers?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

When you write a custom transformer for a scikit-learn pipeline, the hardest part is often not the math. It is preserving column names, row order, and DataFrame structure so downstream steps still make sense. If you want pandas output, design the transformer to accept a DataFrame and return one explicitly.

Start with a Proper Transformer Class

A custom transformer should normally inherit from BaseEstimator and TransformerMixin. That gives you a familiar fit() and transform() interface that works inside pipelines.

Here is a transformer that adds a log-scaled feature and returns a pandas DataFrame:

python

1import numpy as np
2import pandas as pd
3from sklearn.base import BaseEstimator, TransformerMixin
4
5class LogFeatureAdder(BaseEstimator, TransformerMixin):
6    def __init__(self, source_column: str, output_column: str):
7        self.source_column = source_column
8        self.output_column = output_column
9
10    def fit(self, X, y=None):
11        return self
12
13    def transform(self, X):
14        X = X.copy()
15        X[self.output_column] = np.log1p(X[self.source_column])
16        return X

This transformer is simple, but it makes the important design choice up front: transform() returns a DataFrame, not a NumPy array.

Preserve Index and Column Names

Returning a raw array is often what breaks a pipeline that started with pandas. Arrays lose:

column names
row index
dtype hints that may matter later

With a DataFrame-oriented transformer, those labels stay intact.

python

1import pandas as pd
2
3df = pd.DataFrame({
4    "sales": [10, 20, 40],
5    "region": ["east", "west", "east"],
6})
7
8transformer = LogFeatureAdder("sales", "sales_log")
9result = transformer.fit_transform(df)
10
11print(result)
12print(type(result))

This pattern is especially useful when the next step in the pipeline refers to named columns instead of numeric positions.

Use the Transformer Inside a Pipeline

The transformer behaves like any other scikit-learn step.

python

1import pandas as pd
2from sklearn.pipeline import Pipeline
3
4df = pd.DataFrame({
5    "sales": [10, 20, 40],
6    "region": ["east", "west", "east"],
7})
8
9pipeline = Pipeline([
10    ("log_feature", LogFeatureAdder("sales", "sales_log")),
11])
12
13result = pipeline.fit_transform(df)
14print(result)

Because transform() returns a DataFrame, the pipeline output is still pandas at this stage.

Add `get_feature_names_out` for Better Interoperability

If your transformer changes columns, implementing get_feature_names_out() makes the output contract clearer. This matters when the transformer is used alongside other sklearn components that reason about feature names.

python

1import numpy as np
2import pandas as pd
3from sklearn.base import BaseEstimator, TransformerMixin
4
5class RatioTransformer(BaseEstimator, TransformerMixin):
6    def fit(self, X, y=None):
7        self.feature_names_in_ = np.array(X.columns, dtype=object)
8        return self
9
10    def transform(self, X):
11        X = X.copy()
12        X["sales_per_order"] = X["sales"] / X["orders"]
13        return X
14
15    def get_feature_names_out(self, input_features=None):
16        if input_features is None:
17            input_features = self.feature_names_in_
18        return np.append(np.array(input_features, dtype=object), "sales_per_order")

That does not automatically force pandas output by itself, but it makes the transformer friendlier to newer sklearn workflows.

Work Well with `set_output`

Recent scikit-learn versions can produce pandas output from many built-in transformers with set_output(transform="pandas"). That is useful in mixed pipelines, but it works best when the transformer behaves like a good sklearn citizen and exposes sensible feature names.

python

1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler().set_output(transform="pandas")
4scaled = scaler.fit_transform(df[["sales"]])
5print(scaled)

For a custom transformer, the safest route is still to return a DataFrame intentionally unless you have a specific reason to stay array-based.

Be Clear About Input Expectations

If your transformer expects a DataFrame, say so in the code and fail clearly when it receives something else. Silent conversion can hide bugs.

python

1def transform(self, X):
2    if not isinstance(X, pd.DataFrame):
3        raise TypeError("LogFeatureAdder expects a pandas DataFrame")
4    X = X.copy()
5    X[self.output_column] = np.log1p(X[self.source_column])
6    return X

That makes pipeline debugging easier, especially when earlier steps may convert the data type unexpectedly.

Common Pitfalls

Returning a NumPy array and losing column names that downstream code expects.
Mutating the original DataFrame instead of copying it inside transform().
Forgetting to document which input columns the transformer needs.
Skipping get_feature_names_out() when the transformer changes the schema.
Mixing pandas-aware and array-only pipeline steps without checking the handoff type.

Summary

Custom sklearn transformers can return pandas DataFrames directly.
Preserve index and column names when downstream steps depend on labeled data.
Implement fit() and transform() cleanly with BaseEstimator and TransformerMixin.
Add get_feature_names_out() when the transformer changes the output schema.
Be explicit about input and output types so pipeline behavior stays predictable.

How to create pandas output for custom transformers?

Master System Design with Codemia

Introduction

Start with a Proper Transformer Class

Preserve Index and Column Names

Use the Transformer Inside a Pipeline

Add get_feature_names_out for Better Interoperability

Work Well with set_output

Be Clear About Input Expectations

Common Pitfalls

Summary

Add `get_feature_names_out` for Better Interoperability

Work Well with `set_output`