pandas
DataFrame
data manipulation
complex queries
Python programming

Selecting with complex criteria from pandas.DataFrame

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Selecting rows with complex criteria in pandas is mostly about turning business rules into clear boolean masks. The hard part is rarely the syntax itself. The hard part is keeping the filter readable, handling missing data explicitly, and making sure each condition means what you think it means.

Build Filters with Explicit Masks

The most dependable pattern is to compute one mask per rule and then combine them with & and |. That makes each rule inspectable and easy to test.

python
1import pandas as pd
2
3orders = pd.DataFrame({
4    "customer": ["A", "B", "C", "D", "E", "F"],
5    "amount": [120, 45, 310, 80, 150, 200],
6    "country": ["CA", "US", "CA", "US", "CA", "CA"],
7    "priority": ["high", "low", "high", "medium", "medium", "high"],
8    "refunded": [False, False, True, False, False, False],
9    "segment": ["retail", None, "enterprise", "retail", None, "retail"],
10})
11
12high_value = orders["amount"].ge(100)
13canadian = orders["country"].eq("CA")
14not_refunded = ~orders["refunded"]
15allowed_priority = orders["priority"].isin(["high", "medium"])
16
17result = orders.loc[high_value & canadian & not_refunded & allowed_priority]
18print(result)

Notice two habits here. First, the filter logic is named instead of being crammed into one long line. Second, the comparison methods such as .ge() and .eq() make it obvious that the result is a boolean series.

Pandas requires bitwise operators for column-wise logic, so use &, |, and ~, not and, or, or not. Wrap each condition in parentheses when you write the combined form directly.

Handle Missing Values as Part of the Rule

Many complex selections quietly fail because missing data is ignored. If a column can contain NaN or None, make the missing-value rule explicit instead of hoping the comparison behaves as expected.

python
1segment_known = orders["segment"].notna()
2segment_retail = orders["segment"].eq("retail")
3
4filtered = orders.loc[segment_known & segment_retail & not_refunded]
5print(filtered)

If missing values should count as a fallback category, do that intentionally:

python
1normalized_segment = orders["segment"].fillna("unknown")
2mask = normalized_segment.isin(["retail", "unknown"])
3
4print(orders.loc[mask, ["customer", "segment"]])

fillna() is useful, but only when replacing missing values is actually part of the business rule.

Use query When the Expression Reads Like Plain English

For notebook work and data exploration, DataFrame.query() can make a long expression easier to read.

python
1filtered = orders.query(
2    "amount >= 100 and country == 'CA' and priority in ['high', 'medium'] and refunded == False"
3)
4print(filtered)

query() is convenient, but it is not automatically better than boolean masks. As soon as your logic depends on Python variables, custom functions, or a lot of null-handling, masks are usually easier to debug.

Build Derived Criteria Before Filtering

Real filters often depend on a derived value instead of a single column. In those cases, compute the feature first, then filter on it.

python
1orders = orders.assign(
2    risk_score=orders["amount"] * orders["priority"].map({"low": 0.3, "medium": 0.6, "high": 1.0})
3)
4
5risky = orders["risk_score"].gt(90)
6known_segment = orders["segment"].fillna("unknown")
7
8selected = orders.loc[risky & known_segment.isin(["retail", "enterprise"])]
9print(selected[["customer", "amount", "priority", "risk_score"]])

This pattern is easier to maintain than embedding all transformation logic inside one filter expression.

Combine Row-Level and Group-Level Conditions

Some criteria depend on per-group totals, averages, or counts. In that case, calculate the group metric and join it back to the original frame.

python
1group_totals = (
2    orders.groupby("country", as_index=False)["amount"]
3    .sum()
4    .rename(columns={"amount": "country_total"})
5)
6joined = orders.merge(group_totals, on="country", how="left")
7
8selected = joined.loc[
9    joined["country_total"].ge(300)
10    & joined["amount"].ge(100)
11    & ~joined["refunded"]
12]
13
14print(selected[["customer", "country", "amount", "country_total"]])

This is a common pattern in fraud rules, marketing segmentation, and reporting pipelines where row selection depends on both row values and group context.

Wrap Business Rules in a Function

When the same filter is used in production code, wrap it in a named function. That keeps the logic in one place and gives you a clean unit-test target.

python
1def eligible_orders(
2    df: pd.DataFrame,
3    min_amount: int = 100,
4    allowed_countries: list[str] | None = None,
5) -> pd.DataFrame:
6    countries = allowed_countries or ["CA"]
7    mask = (
8        df["amount"].ge(min_amount)
9        & df["country"].isin(countries)
10        & df["priority"].isin(["high", "medium"])
11        & ~df["refunded"]
12    )
13    return df.loc[mask]
14
15print(eligible_orders(orders))

If the filter starts returning the wrong rows, inspect each mask independently and print how many True values each one contains. That is usually faster than staring at a single large expression.

Common Pitfalls

  • Using and or or with pandas series instead of & and |.
  • Forgetting parentheses and getting the wrong precedence.
  • Comparing columns that contain missing values without deciding how missing data should behave.
  • Writing one giant filter expression that cannot be inspected or tested.
  • Filtering on group-level conditions without first merging the aggregate back to the original rows.

Summary

  • Build complex selections from small named boolean masks.
  • Treat missing values as part of the filtering rule, not as an afterthought.
  • Use query() when it improves readability, but prefer masks for more involved logic.
  • Compute derived columns or group metrics before filtering on them.
  • Put production filtering rules in functions so they are reusable and testable.

Course illustration
Course illustration

All Rights Reserved.