How to filter rows in pandas by regex
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Filtering rows by regular expression in pandas usually means testing one or more string columns with Series.str.contains. The key is to choose the right regex pattern, decide how to handle missing values, and remember that regex matching is applied to strings, so mixed dtypes often need cleanup first.
Basic Regex Filtering with str.contains
The standard pattern is:
This creates a boolean mask and uses it to keep matching rows.
Handle Missing Values Deliberately
If the column contains NaN, str.contains can return missing values in the mask unless you tell pandas what to do.
na=False is often the safest default for row filtering because it treats missing strings as non-matches.
Match Whole Words or Anchored Patterns
Regex becomes especially useful when simple substring matching is too broad.
This matches only strings that start with AB- and end with exactly two digits.
Filter Across Multiple Columns
Sometimes you want rows where any text column matches a regex. One practical approach is to build masks per column and combine them.
That keeps rows where either column matches.
Use case=False for Simple Case-Insensitive Matching
If you do not need full regex flags, pandas can handle common case-insensitive filtering directly.
This is easier to read than embedding case-insensitive regex modifiers in the pattern for simple cases.
Convert Non-String Data Carefully
If the column is not purely textual, cast it intentionally before regex operations.
This is acceptable for quick filtering, but if the column is truly numeric, it is usually better to use numeric comparisons instead of regex.
Negating a Regex Filter
Sometimes you want the opposite set: rows that do not match a pattern. In that case, invert the boolean mask with ~.
This pattern is useful when removing temporary rows, excluding test accounts, or dropping records that follow a naming convention you do not want in the final result. It is also a clean way to express exclusion rules without rewriting the regex itself.
Common Pitfalls
A common mistake is forgetting na=False, which can produce masks containing missing values instead of simple True and False. Another is writing an unanchored regex and then matching far more rows than intended. Developers also often use regex where exact matching or startswith would be clearer and faster. Finally, if a column has mixed types, str.contains can fail or behave inconsistently until the data is normalized.
Summary
- Use
Series.str.containsto build a boolean regex mask. - Set
na=Falsewhen missing values should simply count as non-matches. - Use anchors and explicit patterns when you need precise filtering.
- Combine masks with
|or&for multi-column logic. - Prefer simpler string or numeric operations when regex is more power than the problem actually needs.

