pandas
python
data cleaning
NaN handling
str.contains

Ignoring NaNs with str.contains

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When using str.contains() in pandas to filter string data, NaN values in the Series cause the result to contain NaN instead of True/False, which breaks boolean indexing. The fix is to pass na=False (or na=True) to tell pandas how to treat missing values. Without this parameter, any row with NaN produces NaN in the boolean mask, raising errors or silently dropping rows during filtering.

The Problem

python
1import pandas as pd
2import numpy as np
3
4s = pd.Series(["apple pie", "banana split", "cherry tart", np.nan, "apple sauce"])
5
6result = s.str.contains("apple")
7print(result)
8# 0     True
9# 1    False
10# 2    False
11# 3      NaN   <-- NaN, not True or False
12# 4     True

Using this result directly as a boolean filter raises a ValueError:

python
filtered = s[s.str.contains("apple")]
# ValueError: Cannot mask with non-bool dtype
# (In older pandas versions, this may silently drop NaN rows instead)

The Fix: na=False

Pass na=False to treat NaN values as "does not contain the pattern":

python
1result = s.str.contains("apple", na=False)
2print(result)
3# 0     True
4# 1    False
5# 2    False
6# 3    False   <-- NaN becomes False
7# 4     True
8
9filtered = s[s.str.contains("apple", na=False)]
10print(filtered)
11# 0    apple pie
12# 4    apple sauce

Setting na=True treats NaN as matching:

python
1result = s.str.contains("apple", na=True)
2print(result)
3# 0     True
4# 1    False
5# 2    False
6# 3     True   <-- NaN becomes True
7# 4     True

How na= Works with Other str Methods

The na parameter is available on most pandas string accessor methods:

python
1s = pd.Series(["Hello", "world", np.nan, "HELLO", "Python"])
2
3# str.startswith
4s.str.startswith("H", na=False)
5# 0     True
6# 1    False
7# 2    False  <-- NaN → False
8# 3    False
9# 4    False
10
11# str.endswith
12s.str.endswith("lo", na=False)
13
14# str.match (regex match from start)
15s.str.match(r"[A-Z]", na=False)
16
17# str.fullmatch
18s.str.fullmatch(r"\w+", na=False)

Using Regex with str.contains

str.contains supports regular expressions by default:

python
1df = pd.DataFrame({
2    "email": ["[email protected]", "[email protected]", np.nan,
3              "[email protected]", "[email protected]"]
4})
5
6# Find Gmail addresses using regex
7gmail = df[df["email"].str.contains(r"@gmail\.com$", na=False)]
8print(gmail)
9#               email
10# 0   [email protected]
11# 3  [email protected]
12
13# Case-insensitive search
14df["email"].str.contains("GMAIL", case=False, na=False)

To search for literal special characters, disable regex:

python
1s = pd.Series(["price is $10", "cost (estimated)", np.nan, "value: $20"])
2
3# regex=True (default) — $ and ( are regex metacharacters
4# This would NOT work as expected:
5# s.str.contains("$10", na=False)  # $ means end-of-string in regex
6
7# Use regex=False for literal matching
8s.str.contains("$10", regex=False, na=False)
9# 0     True
10# 1    False
11# 2    False
12# 3    False

Filtering a DataFrame

python
1df = pd.DataFrame({
2    "product": ["iPhone 15", "Galaxy S24", np.nan, "Pixel 8", "iPhone 14"],
3    "price": [999, 849, 0, 699, 799]
4})
5
6# Filter rows where product contains "iPhone"
7iphones = df[df["product"].str.contains("iPhone", na=False)]
8print(iphones)
9#      product  price
10# 0  iPhone 15    999
11# 4  iPhone 14    799
12
13# Negate — find products that do NOT contain "iPhone"
14non_iphones = df[~df["product"].str.contains("iPhone", na=False)]
15print(non_iphones)
16#       product  price
17# 1  Galaxy S24    849
18# 2        None      0
19# 3     Pixel 8    699

Alternative: Drop NaN First

Instead of using na=False, you can drop NaN values before filtering:

python
1# Option 1: dropna before filtering
2clean = s.dropna()
3result = clean[clean.str.contains("apple")]
4
5# Option 2: fillna before filtering
6result = s.fillna("").str.contains("apple")
7
8# Option 3: notna combined with str.contains
9mask = s.notna() & s.str.contains("apple", na=False)
10result = s[mask]

Using na=False is the most concise and idiomatic approach.

Performance Comparison

python
1import timeit
2
3s = pd.Series(["text"] * 100000 + [np.nan] * 10000)
4
5# na=False (fastest — single pass)
6%timeit s.str.contains("tex", na=False)
7# ~5 ms
8
9# fillna then contains (slower — two passes)
10%timeit s.fillna("").str.contains("tex")
11# ~8 ms
12
13# dropna then contains (slower and loses index alignment)
14%timeit s.dropna().str.contains("tex")
15# ~7 ms

na=False is both faster and preserves the original index.

Common Pitfalls

  • Forgetting na=False: Without it, NaN values produce NaN in the boolean mask, causing ValueError or silent data loss when used for filtering. Always pass na=False unless you specifically want NaN rows included.
  • Not escaping regex metacharacters: Characters like ., $, (, *, and + have special meaning in regex. Use regex=False for literal string matching, or escape with re.escape().
  • Assuming na=False drops NaN rows: na=False does not remove NaN rows from the result. It replaces NaN with False in the boolean mask, so NaN rows appear as non-matching rows in the filtered output.
  • Case sensitivity: str.contains is case-sensitive by default. Use case=False for case-insensitive matching rather than converting the entire column with str.lower() first.
  • Using str.contains on non-string columns: If the column has mixed types (strings and integers), str.contains converts everything to string first, which may produce unexpected matches. Ensure the column dtype is string or object before filtering.

Summary

  • Pass na=False to str.contains() to treat NaN values as non-matching (False)
  • Pass na=True to treat NaN values as matching (True)
  • The na parameter works on str.startswith, str.endswith, str.match, and other string accessor methods
  • Use regex=False when searching for literal strings that contain regex metacharacters
  • na=False is faster and more idiomatic than fillna("") or dropna() workarounds

Course illustration
Course illustration

All Rights Reserved.