pandas
read_csv
data preprocessing
empty values
nan replacement

Get pandas.read_csv to read empty values as empty string instead of nan

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

By default, pandas.read_csv interprets empty fields as missing values (NaN/<NA>), which is usually correct for analytics. But some workflows need empty strings preserved exactly, such as CSV round-tripping, text normalization pipelines, or downstream systems that distinguish "" from null. If you do not configure parsing behavior, pandas may silently coerce blanks into missing markers, changing semantics before you begin processing.

To keep empty cells as empty strings, configure NA detection and column dtypes deliberately. The right settings depend on whether you want this behavior globally or only for specific columns. This article shows practical patterns, tradeoffs, and validation techniques.

Core Sections

Disable NA detection for strict string preservation

The most direct approach is na_filter=False, which stops missing-value detection.

python
1import pandas as pd
2from io import StringIO
3
4csv_data = StringIO("""name,city,score
5Alice,New York,10
6Bob,,
7,London,7
8""")
9
10df = pd.read_csv(csv_data, na_filter=False)
11print(df)

Resulting empty fields remain "" instead of NaN. This is useful when blank means intentional empty text.

Control default NA tokens explicitly

Pandas treats tokens like "NA", "N/A", and empty strings as missing by default. Override this behavior with keep_default_na=False.

python
1csv_data = StringIO("""code,comment
2NA,keep literal
3,blank field
4""")
5
6df = pd.read_csv(csv_data, keep_default_na=False)

With this option, "NA" remains literal text, and blank fields remain empty strings unless other NA rules are set.

Enforce string dtype where needed

If a column should be textual, set dtype explicitly to avoid numeric coercion and mixed missing semantics.

python
1csv_data = StringIO("""id,zip
21,00123
32,
43,04567
5""")
6
7df = pd.read_csv(
8    csv_data,
9    keep_default_na=False,
10    dtype={"zip": "string"}
11)
12
13print(df["zip"].tolist())

Typed string columns reduce surprises when exporting, validating, or joining with external text datasets.

Handle selective columns differently

Sometimes you want empty strings in text columns but true missing values in numeric fields.

python
1csv_data = StringIO("""name,age,nickname
2Alice,30,
3Bob,,Bobby
4""")
5
6raw = pd.read_csv(csv_data, keep_default_na=False)
7raw["age"] = pd.to_numeric(raw["age"], errors="coerce")

This two-stage parse keeps textual empties intact while restoring numeric missingness intentionally.

Normalize after reading when upstream is inconsistent

If you cannot control read settings globally, normalize columns post-load.

python
1def normalize_text_empties(df, cols):
2    for c in cols:
3        df[c] = df[c].fillna("").astype("string")
4    return df
5
6clean = normalize_text_empties(raw, ["name", "nickname"])

Post-processing is useful in shared codebases where different data sources require different missing-data rules.

Validate behavior with assertions

Always add checks so parser options do not regress.

python
assert df.loc[1, "city"] == ""
assert df.loc[2, "name"] == ""

Small assertions save time when pandas version updates or utility wrappers change defaults.

Common Pitfalls

  • Assuming empty cells stay empty strings by default, then discovering downstream code receives NaN.
  • Disabling NA parsing globally and accidentally keeping true missing markers as literal text everywhere.
  • Letting pandas infer mixed dtypes, which causes inconsistent treatment of blanks across columns.
  • Converting columns to numeric too early and losing intentional empty-string semantics.
  • Skipping validation tests, allowing parser configuration drift to silently alter data contracts.

Summary

To read empty CSV values as empty strings in pandas, configure parser options intentionally, usually with na_filter=False or keep_default_na=False, and set column dtypes where needed. For mixed-use datasets, parse broadly then normalize per column so textual blanks and numeric nulls each keep the meaning you want. Treat this as part of your data contract and add assertions to prevent regressions when dependencies or parsing utilities evolve.

In multi-pipeline organizations, wrap these parser settings in a shared utility function rather than repeating raw read_csv calls everywhere. Centralization prevents inconsistent defaults, makes code reviews easier, and allows one controlled update when parser behavior needs to change. This also helps onboarding, because analysts can rely on one documented import function instead of memorizing multiple parser flag combinations.


Course illustration
Course illustration

All Rights Reserved.