Selecting with complex criteria from pandas.DataFrame

pandas

DataFrame

data manipulation

complex queries

Python programming

Selecting with complex criteria from pandas.DataFrame

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Selecting rows with complex criteria in pandas is mostly about turning business rules into clear boolean masks. The hard part is rarely the syntax itself. The hard part is keeping the filter readable, handling missing data explicitly, and making sure each condition means what you think it means.

Build Filters with Explicit Masks

The most dependable pattern is to compute one mask per rule and then combine them with & and |. That makes each rule inspectable and easy to test.

python

1import pandas as pd
2
3orders = pd.DataFrame({
4    "customer": ["A", "B", "C", "D", "E", "F"],
5    "amount": [120, 45, 310, 80, 150, 200],
6    "country": ["CA", "US", "CA", "US", "CA", "CA"],
7    "priority": ["high", "low", "high", "medium", "medium", "high"],
8    "refunded": [False, False, True, False, False, False],
9    "segment": ["retail", None, "enterprise", "retail", None, "retail"],
10})
11
12high_value = orders["amount"].ge(100)
13canadian = orders["country"].eq("CA")
14not_refunded = ~orders["refunded"]
15allowed_priority = orders["priority"].isin(["high", "medium"])
16
17result = orders.loc[high_value & canadian & not_refunded & allowed_priority]
18print(result)

Notice two habits here. First, the filter logic is named instead of being crammed into one long line. Second, the comparison methods such as .ge() and .eq() make it obvious that the result is a boolean series.

Pandas requires bitwise operators for column-wise logic, so use &, |, and ~, not and, or, or not. Wrap each condition in parentheses when you write the combined form directly.

Handle Missing Values as Part of the Rule

Many complex selections quietly fail because missing data is ignored. If a column can contain NaN or None, make the missing-value rule explicit instead of hoping the comparison behaves as expected.

python

1segment_known = orders["segment"].notna()
2segment_retail = orders["segment"].eq("retail")
3
4filtered = orders.loc[segment_known & segment_retail & not_refunded]
5print(filtered)

If missing values should count as a fallback category, do that intentionally:

python

1normalized_segment = orders["segment"].fillna("unknown")
2mask = normalized_segment.isin(["retail", "unknown"])
3
4print(orders.loc[mask, ["customer", "segment"]])

fillna() is useful, but only when replacing missing values is actually part of the business rule.

Use `query` When the Expression Reads Like Plain English

For notebook work and data exploration, DataFrame.query() can make a long expression easier to read.

python

1filtered = orders.query(
2    "amount >= 100 and country == 'CA' and priority in ['high', 'medium'] and refunded == False"
3)
4print(filtered)

query() is convenient, but it is not automatically better than boolean masks. As soon as your logic depends on Python variables, custom functions, or a lot of null-handling, masks are usually easier to debug.

Build Derived Criteria Before Filtering

Real filters often depend on a derived value instead of a single column. In those cases, compute the feature first, then filter on it.

python

1orders = orders.assign(
2    risk_score=orders["amount"] * orders["priority"].map({"low": 0.3, "medium": 0.6, "high": 1.0})
3)
4
5risky = orders["risk_score"].gt(90)
6known_segment = orders["segment"].fillna("unknown")
7
8selected = orders.loc[risky & known_segment.isin(["retail", "enterprise"])]
9print(selected[["customer", "amount", "priority", "risk_score"]])

This pattern is easier to maintain than embedding all transformation logic inside one filter expression.

Combine Row-Level and Group-Level Conditions

Some criteria depend on per-group totals, averages, or counts. In that case, calculate the group metric and join it back to the original frame.

python

1group_totals = (
2    orders.groupby("country", as_index=False)["amount"]
3    .sum()
4    .rename(columns={"amount": "country_total"})
5)
6joined = orders.merge(group_totals, on="country", how="left")
7
8selected = joined.loc[
9    joined["country_total"].ge(300)
10    & joined["amount"].ge(100)
11    & ~joined["refunded"]
12]
13
14print(selected[["customer", "country", "amount", "country_total"]])

This is a common pattern in fraud rules, marketing segmentation, and reporting pipelines where row selection depends on both row values and group context.

Wrap Business Rules in a Function

When the same filter is used in production code, wrap it in a named function. That keeps the logic in one place and gives you a clean unit-test target.

python

1def eligible_orders(
2    df: pd.DataFrame,
3    min_amount: int = 100,
4    allowed_countries: list[str] | None = None,
5) -> pd.DataFrame:
6    countries = allowed_countries or ["CA"]
7    mask = (
8        df["amount"].ge(min_amount)
9        & df["country"].isin(countries)
10        & df["priority"].isin(["high", "medium"])
11        & ~df["refunded"]
12    )
13    return df.loc[mask]
14
15print(eligible_orders(orders))

If the filter starts returning the wrong rows, inspect each mask independently and print how many True values each one contains. That is usually faster than staring at a single large expression.

Common Pitfalls

Using and or or with pandas series instead of & and |.
Forgetting parentheses and getting the wrong precedence.
Comparing columns that contain missing values without deciding how missing data should behave.
Writing one giant filter expression that cannot be inspected or tested.
Filtering on group-level conditions without first merging the aggregate back to the original rows.

Summary

Build complex selections from small named boolean masks.
Treat missing values as part of the filtering rule, not as an afterthought.
Use query() when it improves readability, but prefer masks for more involved logic.
Compute derived columns or group metrics before filtering on them.
Put production filtering rules in functions so they are reusable and testable.

Selecting with complex criteria from pandas.DataFrame

Master System Design with Codemia

Introduction

Build Filters with Explicit Masks

Handle Missing Values as Part of the Rule

Use query When the Expression Reads Like Plain English

Build Derived Criteria Before Filtering

Combine Row-Level and Group-Level Conditions

Wrap Business Rules in a Function

Common Pitfalls

Summary

Use `query` When the Expression Reads Like Plain English