pandas
regex
data manipulation
Python
data filtering

How to filter rows in pandas by regex

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Filtering rows by regular expression in pandas usually means testing one or more string columns with Series.str.contains. The key is to choose the right regex pattern, decide how to handle missing values, and remember that regex matching is applied to strings, so mixed dtypes often need cleanup first.

Basic Regex Filtering with str.contains

The standard pattern is:

python
1import pandas as pd
2
3frame = pd.DataFrame({
4    "name": ["Alice", "Bob", "Charlie", "Eve"],
5    "city": ["Austin", "Boston", "Chicago", "Atlanta"]
6})
7
8mask = frame["name"].str.contains(r"^A|e$", regex=True)
9print(frame[mask])

This creates a boolean mask and uses it to keep matching rows.

Handle Missing Values Deliberately

If the column contains NaN, str.contains can return missing values in the mask unless you tell pandas what to do.

python
1import pandas as pd
2
3frame = pd.DataFrame({
4    "name": ["Alice", None, "Charlie", "Eve"]
5})
6
7mask = frame["name"].str.contains(r"a", case=False, regex=True, na=False)
8print(frame[mask])

na=False is often the safest default for row filtering because it treats missing strings as non-matches.

Match Whole Words or Anchored Patterns

Regex becomes especially useful when simple substring matching is too broad.

python
1import pandas as pd
2
3frame = pd.DataFrame({
4    "code": ["AB-12", "AB-99", "XY-77", "ABCD"]
5})
6
7mask = frame["code"].str.contains(r"^AB-\d{2}$", regex=True)
8print(frame[mask])

This matches only strings that start with AB- and end with exactly two digits.

Filter Across Multiple Columns

Sometimes you want rows where any text column matches a regex. One practical approach is to build masks per column and combine them.

python
1import pandas as pd
2
3frame = pd.DataFrame({
4    "name": ["Alice", "Bob", "Charlie"],
5    "city": ["Austin", "Denver", "Boston"]
6})
7
8mask = (
9    frame["name"].str.contains(r"^A", na=False) |
10    frame["city"].str.contains(r"^A", na=False)
11)
12
13print(frame[mask])

That keeps rows where either column matches.

Use case=False for Simple Case-Insensitive Matching

If you do not need full regex flags, pandas can handle common case-insensitive filtering directly.

python
mask = frame["name"].str.contains(r"alice", case=False, na=False)

This is easier to read than embedding case-insensitive regex modifiers in the pattern for simple cases.

Convert Non-String Data Carefully

If the column is not purely textual, cast it intentionally before regex operations.

python
1import pandas as pd
2
3frame = pd.DataFrame({
4    "id": [101, 202, 303]
5})
6
7mask = frame["id"].astype(str).str.contains(r"^2")
8print(frame[mask])

This is acceptable for quick filtering, but if the column is truly numeric, it is usually better to use numeric comparisons instead of regex.

Negating a Regex Filter

Sometimes you want the opposite set: rows that do not match a pattern. In that case, invert the boolean mask with ~.

python
mask = ~frame["name"].str.contains(r"^A", na=False)
print(frame[mask])

This pattern is useful when removing temporary rows, excluding test accounts, or dropping records that follow a naming convention you do not want in the final result. It is also a clean way to express exclusion rules without rewriting the regex itself.

Common Pitfalls

A common mistake is forgetting na=False, which can produce masks containing missing values instead of simple True and False. Another is writing an unanchored regex and then matching far more rows than intended. Developers also often use regex where exact matching or startswith would be clearer and faster. Finally, if a column has mixed types, str.contains can fail or behave inconsistently until the data is normalized.

Summary

  • Use Series.str.contains to build a boolean regex mask.
  • Set na=False when missing values should simply count as non-matches.
  • Use anchors and explicit patterns when you need precise filtering.
  • Combine masks with | or & for multi-column logic.
  • Prefer simpler string or numeric operations when regex is more power than the problem actually needs.

Course illustration
Course illustration

All Rights Reserved.