Impute entire DataFrame all columns using Scikit-learn sklearn without iterating over columns

pandas

Scikit-learn

data imputation

Python programming

machine learning

Impute entire DataFrame all columns using Scikit-learn sklearn without iterating over columns

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

You do not need to loop over DataFrame columns one by one to impute missing values in scikit-learn. If every column can use the same strategy, SimpleImputer can transform the whole table in one call. If the DataFrame mixes numeric and categorical data, ColumnTransformer lets you apply different imputers by type without writing your own column iteration logic.

Use `SimpleImputer` on the Whole DataFrame

If all columns are numeric, or all columns should use the same replacement rule, the shortest solution is:

python

1import pandas as pd
2from sklearn.impute import SimpleImputer
3
4df = pd.DataFrame({
5    "age": [30, None, 45],
6    "income": [50000, 62000, None],
7    "score": [0.8, None, 0.9],
8})
9
10imputer = SimpleImputer(strategy="median")
11filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
12
13print(filled)

fit_transform() computes the median of each column and fills missing values column-wise. Scikit-learn still works feature by feature internally, but you do not have to manually write a Python loop.

This approach is clean and fast when the entire DataFrame is compatible with one strategy.

Handle Mixed Types Without Manual Loops

Real datasets often mix numbers and strings. In that case, one imputer for the whole DataFrame is usually wrong. Numeric columns may need median imputation, while categorical columns may need the most frequent value or a constant like "missing".

Use ColumnTransformer with selectors:

python

1import pandas as pd
2from sklearn.compose import ColumnTransformer, make_column_selector
3from sklearn.impute import SimpleImputer
4from sklearn.pipeline import Pipeline
5
6df = pd.DataFrame({
7    "age": [30, None, 45],
8    "income": [50000, 62000, None],
9    "city": ["Toronto", None, "Montreal"],
10    "segment": [None, "B", "A"],
11})
12
13numeric_selector = make_column_selector(dtype_include=["number"])
14categorical_selector = make_column_selector(dtype_include=["object"])
15
16preprocessor = ColumnTransformer(
17    transformers=[
18        ("num", SimpleImputer(strategy="median"), numeric_selector),
19        ("cat", SimpleImputer(strategy="most_frequent"), categorical_selector),
20    ]
21)
22
23result = preprocessor.fit_transform(df)
24print(result)

That still avoids manual iteration in user code. Scikit-learn selects the relevant columns and applies the right transformer to each group.

If you want a DataFrame back instead of a NumPy array, ask the transformer to produce pandas output:

python

preprocessor.set_output(transform="pandas")
filled_df = preprocessor.fit_transform(df)
print(filled_df)

That makes downstream debugging much easier because the result keeps column names.

Fit on Training Data Only

Imputation learns statistics from data, so treat it like any other preprocessing step. Fit the imputer on training data, then apply the learned values to validation or test data.

python

1from sklearn.model_selection import train_test_split
2
3train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)
4
5preprocessor.set_output(transform="pandas")
6train_filled = preprocessor.fit_transform(train_df)
7test_filled = preprocessor.transform(test_df)

This prevents data leakage. If you compute medians or modes using the full dataset before splitting, the test set quietly influences the training pipeline.

Pipelines are especially useful when imputation is only one step:

python

1from sklearn.linear_model import LogisticRegression
2from sklearn.pipeline import Pipeline
3
4model = Pipeline([
5    ("preprocess", preprocessor),
6    ("classifier", LogisticRegression(max_iter=1000)),
7])

That keeps preprocessing and modeling tied together so you do not forget to apply the same transformation at inference time.

Common Pitfalls

The biggest mistake is using one global strategy on mixed data just because it is convenient. Median makes sense for numbers, but not for text fields. Most-frequent imputation can work for categories, but it may distort distributions if overused.

Another common issue is losing column names. fit_transform() often returns a NumPy array by default, which can make later feature tracking harder. Use set_output(transform="pandas") when you want a DataFrame result.

Be careful with entirely empty columns as well. Some imputation setups may drop or mishandle them unless you explicitly decide how to keep them.

Finally, do not confuse "no explicit Python loop" with "no per-column behavior." Good preprocessing still respects column types; it just delegates the repetitive work to scikit-learn rather than hand-written loops.

Summary

'SimpleImputer can impute an entire DataFrame at once when one strategy fits all columns.'
Mixed-type DataFrames are better handled with ColumnTransformer.
'make_column_selector avoids hand-written loops by selecting columns by dtype.'
Fit imputers on training data only to avoid leakage.
Use pandas output when you want to preserve column names after transformation.

Impute entire DataFrame all columns using Scikit-learn sklearn without iterating over columns

Master System Design with Codemia

Introduction

Use SimpleImputer on the Whole DataFrame

Handle Mixed Types Without Manual Loops

Fit on Training Data Only

Common Pitfalls

Summary

Use `SimpleImputer` on the Whole DataFrame