pandas
data transformation
custom transformers
Python
data processing

How to create pandas output for custom transformers?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When you write a custom transformer for a scikit-learn pipeline, the hardest part is often not the math. It is preserving column names, row order, and DataFrame structure so downstream steps still make sense. If you want pandas output, design the transformer to accept a DataFrame and return one explicitly.

Start with a Proper Transformer Class

A custom transformer should normally inherit from BaseEstimator and TransformerMixin. That gives you a familiar fit() and transform() interface that works inside pipelines.

Here is a transformer that adds a log-scaled feature and returns a pandas DataFrame:

python
1import numpy as np
2import pandas as pd
3from sklearn.base import BaseEstimator, TransformerMixin
4
5class LogFeatureAdder(BaseEstimator, TransformerMixin):
6    def __init__(self, source_column: str, output_column: str):
7        self.source_column = source_column
8        self.output_column = output_column
9
10    def fit(self, X, y=None):
11        return self
12
13    def transform(self, X):
14        X = X.copy()
15        X[self.output_column] = np.log1p(X[self.source_column])
16        return X

This transformer is simple, but it makes the important design choice up front: transform() returns a DataFrame, not a NumPy array.

Preserve Index and Column Names

Returning a raw array is often what breaks a pipeline that started with pandas. Arrays lose:

  • column names
  • row index
  • dtype hints that may matter later

With a DataFrame-oriented transformer, those labels stay intact.

python
1import pandas as pd
2
3df = pd.DataFrame({
4    "sales": [10, 20, 40],
5    "region": ["east", "west", "east"],
6})
7
8transformer = LogFeatureAdder("sales", "sales_log")
9result = transformer.fit_transform(df)
10
11print(result)
12print(type(result))

This pattern is especially useful when the next step in the pipeline refers to named columns instead of numeric positions.

Use the Transformer Inside a Pipeline

The transformer behaves like any other scikit-learn step.

python
1import pandas as pd
2from sklearn.pipeline import Pipeline
3
4df = pd.DataFrame({
5    "sales": [10, 20, 40],
6    "region": ["east", "west", "east"],
7})
8
9pipeline = Pipeline([
10    ("log_feature", LogFeatureAdder("sales", "sales_log")),
11])
12
13result = pipeline.fit_transform(df)
14print(result)

Because transform() returns a DataFrame, the pipeline output is still pandas at this stage.

Add get_feature_names_out for Better Interoperability

If your transformer changes columns, implementing get_feature_names_out() makes the output contract clearer. This matters when the transformer is used alongside other sklearn components that reason about feature names.

python
1import numpy as np
2import pandas as pd
3from sklearn.base import BaseEstimator, TransformerMixin
4
5class RatioTransformer(BaseEstimator, TransformerMixin):
6    def fit(self, X, y=None):
7        self.feature_names_in_ = np.array(X.columns, dtype=object)
8        return self
9
10    def transform(self, X):
11        X = X.copy()
12        X["sales_per_order"] = X["sales"] / X["orders"]
13        return X
14
15    def get_feature_names_out(self, input_features=None):
16        if input_features is None:
17            input_features = self.feature_names_in_
18        return np.append(np.array(input_features, dtype=object), "sales_per_order")

That does not automatically force pandas output by itself, but it makes the transformer friendlier to newer sklearn workflows.

Work Well with set_output

Recent scikit-learn versions can produce pandas output from many built-in transformers with set_output(transform="pandas"). That is useful in mixed pipelines, but it works best when the transformer behaves like a good sklearn citizen and exposes sensible feature names.

python
1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler().set_output(transform="pandas")
4scaled = scaler.fit_transform(df[["sales"]])
5print(scaled)

For a custom transformer, the safest route is still to return a DataFrame intentionally unless you have a specific reason to stay array-based.

Be Clear About Input Expectations

If your transformer expects a DataFrame, say so in the code and fail clearly when it receives something else. Silent conversion can hide bugs.

python
1def transform(self, X):
2    if not isinstance(X, pd.DataFrame):
3        raise TypeError("LogFeatureAdder expects a pandas DataFrame")
4    X = X.copy()
5    X[self.output_column] = np.log1p(X[self.source_column])
6    return X

That makes pipeline debugging easier, especially when earlier steps may convert the data type unexpectedly.

Common Pitfalls

  • Returning a NumPy array and losing column names that downstream code expects.
  • Mutating the original DataFrame instead of copying it inside transform().
  • Forgetting to document which input columns the transformer needs.
  • Skipping get_feature_names_out() when the transformer changes the schema.
  • Mixing pandas-aware and array-only pipeline steps without checking the handoff type.

Summary

  • Custom sklearn transformers can return pandas DataFrames directly.
  • Preserve index and column names when downstream steps depend on labeled data.
  • Implement fit() and transform() cleanly with BaseEstimator and TransformerMixin.
  • Add get_feature_names_out() when the transformer changes the output schema.
  • Be explicit about input and output types so pipeline behavior stays predictable.

Course illustration
Course illustration

All Rights Reserved.