pandas
data analysis
reproducible examples
coding best practices
Python tutorials

How to make good reproducible pandas examples

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

A reproducible pandas example (often called a "minimal reproducible example" or MRE) is a self-contained code snippet that anyone can copy-paste and run to reproduce your result or problem. Good MREs include sample data, all necessary imports, the code that produces the issue, and the expected vs actual output. This is essential when asking questions on Stack Overflow, filing bug reports, or collaborating with teammates.

The Core Components

Every good pandas MRE needs four things:

  1. Imports — all libraries used in the example
  2. Sample data — a small DataFrame that reproduces the issue
  3. Code — the operation that demonstrates the problem
  4. Output — expected result vs actual result

Creating Sample DataFrames

From a Dictionary (Most Common)

python
1import pandas as pd
2
3df = pd.DataFrame({
4    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
5    'age': [30, 25, 35, 28],
6    'salary': [70000, 50000, 90000, 65000],
7    'department': ['Engineering', 'Marketing', 'Engineering', 'Marketing']
8})
9print(df)
10#       name  age  salary   department
11# 0    Alice   30   70000  Engineering
12# 1      Bob   25   50000    Marketing
13# 2  Charlie   35   90000  Engineering
14# 3    Diana   28   65000    Marketing

From a String (Copy-Paste Friendly)

python
1import pandas as pd
2from io import StringIO
3
4data = """name,age,salary
5Alice,30,70000
6Bob,25,50000
7Charlie,35,90000"""
8
9df = pd.read_csv(StringIO(data))

Using pd.read_clipboard() (For Quick Sharing)

python
1# Copy a table from Excel/web, then:
2df = pd.read_clipboard()
3print(df.to_dict())
4# Share the dict output so others can recreate without clipboard

Generating Random Data

python
1import pandas as pd
2import numpy as np
3
4np.random.seed(42)  # Makes random data reproducible
5
6df = pd.DataFrame({
7    'value': np.random.randn(100),
8    'category': np.random.choice(['A', 'B', 'C'], 100),
9    'date': pd.date_range('2025-01-01', periods=100, freq='D')
10})

Always set np.random.seed() so others get the same random values.

Sharing Your DataFrame

When asking for help, include the DataFrame definition in a copy-pasteable format:

python
1# Option 1: to_dict() — most reliable
2print(df.to_dict())
3# {'name': {0: 'Alice', 1: 'Bob'}, 'age': {0: 30, 1: 25}}
4
5# Option 2: to_dict('list') — cleaner for simple DataFrames
6print(df.to_dict('list'))
7# {'name': ['Alice', 'Bob'], 'age': [30, 25]}
8
9# Recreate from dict:
10df = pd.DataFrame({'name': ['Alice', 'Bob'], 'age': [30, 25]})

Including Data Types

If data types matter (they often do), show them:

python
1print(df.dtypes)
2# name         object
3# age           int64
4# salary        int64
5# department    object
6
7# If a column needs a specific type, convert explicitly
8df['date'] = pd.to_datetime(df['date'])
9df['category'] = df['category'].astype('category')

Show Expected vs Actual Output

python
1import pandas as pd
2
3df = pd.DataFrame({
4    'group': ['A', 'A', 'B', 'B'],
5    'value': [10, 20, 30, 40]
6})
7
8# What I tried:
9result = df.groupby('group').sum()
10
11# What I got:
12#        value
13# group
14# A         30
15# B         70
16
17# What I expected:
18# (describe expected output here)

Including Version Information

python
1print(pd.__version__)   # 2.2.0
2print(np.__version__)   # 1.26.4
3
4import sys
5print(sys.version)       # 3.12.1

Version differences cause different behavior — always include this for bug reports.

Template for Stack Overflow

python
1import pandas as pd
2import numpy as np
3
4# Sample data
5df = pd.DataFrame({
6    'date': pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03']),
7    'value': [100, 200, 150],
8    'category': ['A', 'B', 'A']
9})
10
11# What I tried
12result = df.groupby('category')['value'].mean()
13
14# Actual output
15print(result)
16
17# Expected output
18# category
19# A    125.0
20# B    200.0
21
22# Versions
23print(f"pandas: {pd.__version__}")

Tips for Minimal Data

Keep the DataFrame as small as possible while still reproducing the issue:

python
1# BAD: 1000 rows when 4 would suffice
2df = pd.DataFrame({'x': range(1000), 'y': range(1000)})
3
4# GOOD: minimal rows that show the problem
5df = pd.DataFrame({'x': [1, 2, None, 4], 'y': [10, 20, 30, 40]})
6# The None value is what causes the issue — 4 rows are enough

Include edge cases that trigger the problem: None/NaN values, duplicate indices, mixed types, or empty groups.

Common Pitfalls

  • Using real data: Never share proprietary or sensitive data. Create synthetic data that reproduces the same structure and problem.
  • Missing imports: Always include import pandas as pd and any other libraries. Do not assume the reader has already imported them.
  • Omitting the seed: np.random.seed(42) must appear before any random data generation. Without it, every person gets different data and may not reproduce the issue.
  • Too much data: If your DataFrame has 50 columns but the issue involves 2, include only those 2 columns. Strip everything irrelevant.
  • Not showing the error: Include the full traceback, not just "it doesn't work." Use print(result) to show the actual output explicitly.

Summary

  • Include all imports, sample data, your code, and expected vs actual output
  • Create small DataFrames with pd.DataFrame({...}) — keep rows under 10 when possible
  • Use np.random.seed() for reproducible random data
  • Share data with df.to_dict() for easy copy-paste reconstruction
  • Include pd.__version__ for version-sensitive issues
  • Strip the example to the minimum data and code that reproduces the problem

Course illustration
Course illustration

All Rights Reserved.