How to make good reproducible pandas examples

pandas

data analysis

reproducible examples

coding best practices

Python tutorials

How to make good reproducible pandas examples

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A reproducible pandas example (often called a "minimal reproducible example" or MRE) is a self-contained code snippet that anyone can copy-paste and run to reproduce your result or problem. Good MREs include sample data, all necessary imports, the code that produces the issue, and the expected vs actual output. This is essential when asking questions on Stack Overflow, filing bug reports, or collaborating with teammates.

The Core Components

Every good pandas MRE needs four things:

Imports — all libraries used in the example
Sample data — a small DataFrame that reproduces the issue
Code — the operation that demonstrates the problem
Output — expected result vs actual result

Creating Sample DataFrames

From a Dictionary (Most Common)

python

1import pandas as pd
2
3df = pd.DataFrame({
4    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
5    'age': [30, 25, 35, 28],
6    'salary': [70000, 50000, 90000, 65000],
7    'department': ['Engineering', 'Marketing', 'Engineering', 'Marketing']
8})
9print(df)
10#       name  age  salary   department
11# 0    Alice   30   70000  Engineering
12# 1      Bob   25   50000    Marketing
13# 2  Charlie   35   90000  Engineering
14# 3    Diana   28   65000    Marketing

From a String (Copy-Paste Friendly)

python

1import pandas as pd
2from io import StringIO
3
4data = """name,age,salary
5Alice,30,70000
6Bob,25,50000
7Charlie,35,90000"""
8
9df = pd.read_csv(StringIO(data))

python

1# Copy a table from Excel/web, then:
2df = pd.read_clipboard()
3print(df.to_dict())
4# Share the dict output so others can recreate without clipboard

Generating Random Data

python

1import pandas as pd
2import numpy as np
3
4np.random.seed(42)  # Makes random data reproducible
5
6df = pd.DataFrame({
7    'value': np.random.randn(100),
8    'category': np.random.choice(['A', 'B', 'C'], 100),
9    'date': pd.date_range('2025-01-01', periods=100, freq='D')
10})

Always set np.random.seed() so others get the same random values.

When asking for help, include the DataFrame definition in a copy-pasteable format:

python

1# Option 1: to_dict() — most reliable
2print(df.to_dict())
3# {'name': {0: 'Alice', 1: 'Bob'}, 'age': {0: 30, 1: 25}}
4
5# Option 2: to_dict('list') — cleaner for simple DataFrames
6print(df.to_dict('list'))
7# {'name': ['Alice', 'Bob'], 'age': [30, 25]}
8
9# Recreate from dict:
10df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [30, 25]})

Including Data Types

If data types matter (they often do), show them:

python

1print(df.dtypes)
2# name         object
3# age           int64
4# salary        int64
5# department    object
6
7# If a column needs a specific type, convert explicitly
8df['date'] = pd.to_datetime(df['date'])
9df['category'] = df['category'].astype('category')

Show Expected vs Actual Output

python

1import pandas as pd
2
3df = pd.DataFrame({
4    'group': ['A', 'A', 'B', 'B'],
5    'value': [10, 20, 30, 40]
6})
7
8# What I tried:
9result = df.groupby('group').sum()
10
11# What I got:
12#        value
13# group
14# A         30
15# B         70
16
17# What I expected:
18# (describe expected output here)

Including Version Information

python

1print(pd.__version__)   # 2.2.0
2print(np.__version__)   # 1.26.4
3
4import sys
5print(sys.version)       # 3.12.1

Version differences cause different behavior — always include this for bug reports.

Template for Stack Overflow

python

1import pandas as pd
2import numpy as np
3
4# Sample data
5df = pd.DataFrame({
6    'date': pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03']),
7    'value': [100, 200, 150],
8    'category': ['A', 'B', 'A']
9})
10
11# What I tried
12result = df.groupby('category')['value'].mean()
13
14# Actual output
15print(result)
16
17# Expected output
18# category
19# A    125.0
20# B    200.0
21
22# Versions
23print(f"pandas: {pd.__version__}")

Tips for Minimal Data

Keep the DataFrame as small as possible while still reproducing the issue:

python

1# BAD: 1000 rows when 4 would suffice
2df = pd.DataFrame({'x': range(1000), 'y': range(1000)})
3
4# GOOD: minimal rows that show the problem
5df = pd.DataFrame({'x': [1, 2, None, 4], 'y': [10, 20, 30, 40]})
6# The None value is what causes the issue — 4 rows are enough

Include edge cases that trigger the problem: None/NaN values, duplicate indices, mixed types, or empty groups.

Common Pitfalls

Using real data: Never share proprietary or sensitive data. Create synthetic data that reproduces the same structure and problem.
Missing imports: Always include import pandas as pd and any other libraries. Do not assume the reader has already imported them.
Omitting the seed: np.random.seed(42) must appear before any random data generation. Without it, every person gets different data and may not reproduce the issue.
Too much data: If your DataFrame has 50 columns but the issue involves 2, include only those 2 columns. Strip everything irrelevant.
Not showing the error: Include the full traceback, not just "it doesn't work." Use print(result) to show the actual output explicitly.

Summary

Include all imports, sample data, your code, and expected vs actual output
Create small DataFrames with pd.DataFrame({...}) — keep rows under 10 when possible
Use np.random.seed() for reproducible random data
Share data with df.to_dict() for easy copy-paste reconstruction
Include pd.__version__ for version-sensitive issues
Strip the example to the minimum data and code that reproduces the problem

How to make good reproducible pandas examples

Master System Design with Codemia

Introduction

The Core Components

Creating Sample DataFrames

From a Dictionary (Most Common)

From a String (Copy-Paste Friendly)

Using pd.read_clipboard() (For Quick Sharing)

Generating Random Data

Sharing Your DataFrame

Including Data Types

Show Expected vs Actual Output

Including Version Information

Template for Stack Overflow

Tips for Minimal Data

Common Pitfalls

Summary