data analysis
pandas
python
DataFrame comparison
data science

Compare two DataFrames and output their differences side-by-side

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Comparing two pandas DataFrames is a standard step in ETL validation, migration testing, and data quality checks. Useful diff output depends on goal: row presence changes, cell value changes, or schema drift. A reliable side-by-side comparison workflow aligns frames first, then applies explicit comparison rules for nulls and numeric tolerance.

Align DataFrames Before Diffing

Most false positives come from mismatched ordering, not actual value drift. Align index and columns first.

python
1import pandas as pd
2
3left = pd.DataFrame(
4    {
5        "id": [1, 2, 3],
6        "name": ["A", "B", "C"],
7        "score": [90.0, 81.0, 75.0],
8    }
9).set_index("id")
10
11right = pd.DataFrame(
12    {
13        "id": [1, 2, 3],
14        "name": ["A", "B2", "C"],
15        "score": [90.0, 80.9999, 75.0],
16    }
17).set_index("id")
18
19left, right = left.align(right, join="outer", axis=0)
20left, right = left.align(right, join="outer", axis=1)

Without this step, you can report differences that are only sort-order artifacts.

Cell-Level Side-by-Side Diff with compare

For value-level changes, DataFrame.compare is the clearest built-in method.

python
cell_diff = left.compare(right, keep_shape=True, keep_equal=False)
print(cell_diff)

Output pairs changed values by column using self and other subcolumns. This is usually the best format for human review in debugging and pull request attachments.

Row-Level Added and Removed Records

Cell diff does not show missing rows clearly. Use merge indicator for presence-level drift.

python
1l = left.reset_index()
2r = right.reset_index()
3
4row_presence = l.merge(r, on="id", how="outer", indicator=True, suffixes=("_left", "_right"))
5print(row_presence[row_presence["_merge"] != "both"])

This identifies rows only in left or only in right quickly.

Numeric Tolerance for Floating Fields

Exact comparison is often too strict for floating values because minor precision differences are expected.

python
1import numpy as np
2
3mask_close = np.isclose(left["score"], right["score"], rtol=1e-5, atol=1e-8)
4print(mask_close)
5print("material score diffs:", list(left.index[~mask_close]))

Tolerance-aware checks reduce noise and focus attention on meaningful drift.

Null and Type Normalization Rules

Before diffing, normalize nulls and data types according to business rules. Null handling choices should be explicit.

python
1def normalize_for_diff(df: pd.DataFrame) -> pd.DataFrame:
2    out = df.copy()
3
4    for col in out.columns:
5        if pd.api.types.is_string_dtype(out[col]):
6            out[col] = out[col].fillna("")
7
8    return out
9
10left_n = normalize_for_diff(left)
11right_n = normalize_for_diff(right)
12print(left_n.compare(right_n, keep_shape=True, keep_equal=False))

If your policy treats null and empty string differently, do not normalize this way. The point is to define one rule and apply it consistently.

Create Compact Diff Reports for Review

Full side-by-side output can be large. For review workflows, build concise report listing key and changed columns.

python
1changes = []
2
3for idx in left.index.intersection(right.index):
4    cols = [c for c in left.columns if left.at[idx, c] != right.at[idx, c]]
5    if cols:
6        changes.append({"id": idx, "changed_columns": cols})
7
8report = pd.DataFrame(changes)
9print(report)

This is easier for non-analyst stakeholders than raw matrix output.

Export Side-by-Side Snapshot

For audit artifacts, export left and right prefixed columns together.

python
1combined = pd.concat(
2    [left.add_prefix("left_"), right.add_prefix("right_")],
3    axis=1,
4)
5combined.to_csv("df_side_by_side_diff.csv")

Prefixing avoids column-name collisions and keeps value provenance obvious.

Add Schema Drift Checks First

Before value comparison, verify structural compatibility.

python
1missing_in_right = set(left.columns) - set(right.columns)
2missing_in_left = set(right.columns) - set(left.columns)
3
4print("missing in right:", sorted(missing_in_right))
5print("missing in left:", sorted(missing_in_left))
6print("dtype differences:")
7print(left.dtypes.compare(right.dtypes))

Schema mismatches should usually fail fast before cell-level comparison runs.

Common Pitfalls

  • Comparing unaligned DataFrames and reporting false differences.
  • Using exact float equality for noisy numeric fields.
  • Ignoring null normalization policy and getting inconsistent diff output.
  • Mixing row-level and cell-level diff concerns in one unreadable report.
  • Duplicating custom diff logic across scripts instead of centralizing helper utilities.

Summary

  • Align index and columns before any DataFrame comparison.
  • Use compare for side-by-side cell-level differences.
  • Use merge indicator for row presence changes.
  • Apply explicit tolerance and null rules to reduce noisy diffs.
  • Run schema drift checks before value checks.
  • Export concise, review-friendly diff reports for automated and manual workflows.

Course illustration
Course illustration

All Rights Reserved.