How to make good reproducible pandas examples
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
A reproducible pandas example (often called a "minimal reproducible example" or MRE) is a self-contained code snippet that anyone can copy-paste and run to reproduce your result or problem. Good MREs include sample data, all necessary imports, the code that produces the issue, and the expected vs actual output. This is essential when asking questions on Stack Overflow, filing bug reports, or collaborating with teammates.
The Core Components
Every good pandas MRE needs four things:
- Imports — all libraries used in the example
- Sample data — a small DataFrame that reproduces the issue
- Code — the operation that demonstrates the problem
- Output — expected result vs actual result
Creating Sample DataFrames
From a Dictionary (Most Common)
From a String (Copy-Paste Friendly)
Using pd.read_clipboard() (For Quick Sharing)
Generating Random Data
Always set np.random.seed() so others get the same random values.
Sharing Your DataFrame
When asking for help, include the DataFrame definition in a copy-pasteable format:
Including Data Types
If data types matter (they often do), show them:
Show Expected vs Actual Output
Including Version Information
Version differences cause different behavior — always include this for bug reports.
Template for Stack Overflow
Tips for Minimal Data
Keep the DataFrame as small as possible while still reproducing the issue:
Include edge cases that trigger the problem: None/NaN values, duplicate indices, mixed types, or empty groups.
Common Pitfalls
- Using real data: Never share proprietary or sensitive data. Create synthetic data that reproduces the same structure and problem.
- Missing imports: Always include
import pandas as pdand any other libraries. Do not assume the reader has already imported them. - Omitting the seed:
np.random.seed(42)must appear before any random data generation. Without it, every person gets different data and may not reproduce the issue. - Too much data: If your DataFrame has 50 columns but the issue involves 2, include only those 2 columns. Strip everything irrelevant.
- Not showing the error: Include the full traceback, not just "it doesn't work." Use
print(result)to show the actual output explicitly.
Summary
- Include all imports, sample data, your code, and expected vs actual output
- Create small DataFrames with
pd.DataFrame({...})— keep rows under 10 when possible - Use
np.random.seed()for reproducible random data - Share data with
df.to_dict()for easy copy-paste reconstruction - Include
pd.__version__for version-sensitive issues - Strip the example to the minimum data and code that reproduces the problem

