pandas
dataframes
python
data-manipulation
duplicate-question

Selecting/excluding sets of columns in pandas

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Selecting and excluding columns in pandas is easy when the schema is small, but it becomes error-prone once column groups are chosen by name patterns, dtypes, or reusable business rules. The most useful approach is to make the selection rule explicit so later pipeline steps can rely on stable column sets.

Select Exact Column Sets with a List

The clearest form is an explicit list of column names.

python
1import pandas as pd
2
3df = pd.DataFrame(
4    {
5        "id": [1, 2],
6        "name": ["Ada", "Ben"],
7        "email": ["[email protected]", "[email protected]"],
8        "score": [98, 87],
9    }
10)
11
12cols = ["id", "name", "email"]
13out = df[cols]
14print(out)

This is the best option when the schema is stable and you want missing columns to fail loudly with a KeyError.

Exclude Columns with drop

If the easier rule is "keep everything except these columns", use drop.

python
exclude = ["email"]
out = df.drop(columns=exclude)
print(out)

You can also use a boolean mask:

python
out = df.loc[:, ~df.columns.isin(exclude)]
print(out)

That pattern is useful when the exclusion list is computed dynamically.

Select Columns by Pattern

Real datasets often group columns by prefixes or suffixes such as feature_, _id, or metric_2024. filter is a compact way to select by regex.

python
1df = pd.DataFrame(
2    {
3        "feature_age": [20, 30],
4        "feature_score": [0.8, 0.3],
5        "target": [1, 0],
6        "user_id": [10, 20],
7    }
8)
9
10features = df.filter(regex=r"^feature_")
11print(features)

The important habit is to inspect the matched column names first when the regex is nontrivial.

Select by Data Type for Preprocessing

For preprocessing pipelines, dtype-based selection is often cleaner than name-based selection.

python
numeric = df.select_dtypes(include=["number"])
print(numeric)

This is common when numeric features and categorical features go through different model-preparation steps.

Keep Original Column Order Intentionally

If you build a column set conditionally, a list comprehension over df.columns preserves original order.

python
1wanted = [
2    c for c in df.columns
3    if c.startswith("feature_") or c in {"target", "user_id"}
4]
5
6out = df[wanted]
7print(out.columns.tolist())

This matters because downstream code often assumes a stable feature order.

Make Selection Rules Reusable

For larger projects, wrap column rules in small helper functions so notebooks and pipelines do not drift apart.

python
1def feature_columns(frame: pd.DataFrame) -> list[str]:
2    return [c for c in frame.columns if c.startswith("feature_")]
3
4selected = df[feature_columns(df)]

That is easier to maintain than copying regex or exclusion logic across several files.

Be Explicit About Missing-Column Policy

One of the most useful design choices is deciding whether missing columns should fail loudly or be ignored. For strict ETL pipelines, a KeyError is often correct because it catches schema drift early. For looser reporting code, intersecting the requested list with df.columns may be the more practical behavior.

Reusable Helpers Keep Pipelines Consistent

When the same column groups appear in training, scoring, and reporting code, wrap the rule in a helper function instead of repeating ad hoc list logic. Consistency matters as much as the one-off selection itself, because small differences in selected columns can silently change downstream results.

Common Pitfalls

  • Using count()-style thinking and forgetting that missing columns raise KeyError with direct list selection.
  • Writing a broad regex that accidentally matches more columns than intended.
  • Losing column order by converting sets back into lists arbitrarily.
  • Selecting by dtype when some columns were parsed into the wrong dtype upstream.
  • Duplicating column-group logic across notebooks, scripts, and production code.

Summary

  • Use explicit column lists when the schema is stable and correctness should fail loudly.
  • Use drop(columns=...) or isin masks when exclusion is the easier rule.
  • Use filter(regex=...) for prefix, suffix, or general pattern-based groups.
  • Use select_dtypes for preprocessing workflows driven by data type.
  • Preserve column order intentionally and centralize reusable selection rules.

Course illustration
Course illustration

All Rights Reserved.