Compare two DataFrames and output their differences side-by-side

data analysis

pandas

python

DataFrame comparison

data science

Compare two DataFrames and output their differences side-by-side

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Comparing two pandas DataFrames is a standard step in ETL validation, migration testing, and data quality checks. Useful diff output depends on goal: row presence changes, cell value changes, or schema drift. A reliable side-by-side comparison workflow aligns frames first, then applies explicit comparison rules for nulls and numeric tolerance.

Align DataFrames Before Diffing

Most false positives come from mismatched ordering, not actual value drift. Align index and columns first.

python

1import pandas as pd
2
3left = pd.DataFrame(
4    {
5        "id": [1, 2, 3],
6        "name": ["A", "B", "C"],
7        "score": [90.0, 81.0, 75.0],
8    }
9).set_index("id")
10
11right = pd.DataFrame(
12    {
13        "id": [1, 2, 3],
14        "name": ["A", "B2", "C"],
15        "score": [90.0, 80.9999, 75.0],
16    }
17).set_index("id")
18
19left, right = left.align(right, join="outer", axis=0)
20left, right = left.align(right, join="outer", axis=1)

Without this step, you can report differences that are only sort-order artifacts.

Cell-Level Side-by-Side Diff with `compare`

For value-level changes, DataFrame.compare is the clearest built-in method.

python

cell_diff = left.compare(right, keep_shape=True, keep_equal=False)
print(cell_diff)

Output pairs changed values by column using self and other subcolumns. This is usually the best format for human review in debugging and pull request attachments.

Row-Level Added and Removed Records

Cell diff does not show missing rows clearly. Use merge indicator for presence-level drift.

python

1l = left.reset_index()
2r = right.reset_index()
3
4row_presence = l.merge(r, on="id", how="outer", indicator=True, suffixes=("_left", "_right"))
5print(row_presence[row_presence["_merge"] != "both"])

This identifies rows only in left or only in right quickly.

Numeric Tolerance for Floating Fields

Exact comparison is often too strict for floating values because minor precision differences are expected.

python

1import numpy as np
2
3mask_close = np.isclose(left["score"], right["score"], rtol=1e-5, atol=1e-8)
4print(mask_close)
5print("material score diffs:", list(left.index[~mask_close]))

Tolerance-aware checks reduce noise and focus attention on meaningful drift.

Null and Type Normalization Rules

Before diffing, normalize nulls and data types according to business rules. Null handling choices should be explicit.

python

1def normalize_for_diff(df: pd.DataFrame) -> pd.DataFrame:
2    out = df.copy()
3
4    for col in out.columns:
5        if pd.api.types.is_string_dtype(out[col]):
6            out[col] = out[col].fillna("")
7
8    return out
9
10left_n = normalize_for_diff(left)
11right_n = normalize_for_diff(right)
12print(left_n.compare(right_n, keep_shape=True, keep_equal=False))

If your policy treats null and empty string differently, do not normalize this way. The point is to define one rule and apply it consistently.

Create Compact Diff Reports for Review

Full side-by-side output can be large. For review workflows, build concise report listing key and changed columns.

python

1changes = []
2
3for idx in left.index.intersection(right.index):
4    cols = [c for c in left.columns if left.at[idx, c] != right.at[idx, c]]
5    if cols:
6        changes.append({"id": idx, "changed_columns": cols})
7
8report = pd.DataFrame(changes)
9print(report)

This is easier for non-analyst stakeholders than raw matrix output.

Export Side-by-Side Snapshot

For audit artifacts, export left and right prefixed columns together.

python

1combined = pd.concat(
2    [left.add_prefix("left_"), right.add_prefix("right_")],
3    axis=1,
4)
5combined.to_csv("df_side_by_side_diff.csv")

Prefixing avoids column-name collisions and keeps value provenance obvious.

Add Schema Drift Checks First

Before value comparison, verify structural compatibility.

python

1missing_in_right = set(left.columns) - set(right.columns)
2missing_in_left = set(right.columns) - set(left.columns)
3
4print("missing in right:", sorted(missing_in_right))
5print("missing in left:", sorted(missing_in_left))
6print("dtype differences:")
7print(left.dtypes.compare(right.dtypes))

Schema mismatches should usually fail fast before cell-level comparison runs.

Common Pitfalls

Comparing unaligned DataFrames and reporting false differences.
Using exact float equality for noisy numeric fields.
Ignoring null normalization policy and getting inconsistent diff output.
Mixing row-level and cell-level diff concerns in one unreadable report.
Duplicating custom diff logic across scripts instead of centralizing helper utilities.

Summary

Align index and columns before any DataFrame comparison.
Use compare for side-by-side cell-level differences.
Use merge indicator for row presence changes.
Apply explicit tolerance and null rules to reduce noisy diffs.
Run schema drift checks before value checks.
Export concise, review-friendly diff reports for automated and manual workflows.

Compare two DataFrames and output their differences side-by-side

Master System Design with Codemia

Introduction

Align DataFrames Before Diffing

Cell-Level Side-by-Side Diff with compare

Row-Level Added and Removed Records

Numeric Tolerance for Floating Fields

Null and Type Normalization Rules

Create Compact Diff Reports for Review

Export Side-by-Side Snapshot

Add Schema Drift Checks First

Common Pitfalls

Summary

Cell-Level Side-by-Side Diff with `compare`