pandas
dataframes
concatenate
Python
data manipulation

Concatenate a list of pandas dataframes together

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Combining multiple DataFrames into one is one of the most frequent operations in data analysis with pandas. Whether you are merging monthly reports, stacking results from parallel computations, or assembling data loaded from multiple CSV files, pd.concat() is the tool you reach for. Understanding its parameters -- especially axis, join, and ignore_index -- will help you avoid subtle bugs with misaligned indices and missing columns.

Basic Concatenation with pd.concat

The simplest case is stacking DataFrames that share the same columns vertically (row-wise):

python
1import pandas as pd
2
3df1 = pd.DataFrame({"name": ["Alice", "Bob"], "score": [90, 85]})
4df2 = pd.DataFrame({"name": ["Carol", "Dave"], "score": [88, 92]})
5df3 = pd.DataFrame({"name": ["Eve"], "score": [95]})
6
7result = pd.concat([df1, df2, df3])
8print(result)

Output:

 
1    name  score
20  Alice     90
31    Bob     85
40  Carol     88
51   Dave     92
60    Eve     95

Notice that the original indices (0, 1, 0, 1, 0) are preserved. This is often not what you want.

Resetting the Index with ignore_index

When stacking rows, duplicate index values are confusing and can cause bugs in downstream code that assumes unique indices. Use ignore_index=True to generate a fresh sequential index:

python
result = pd.concat([df1, df2, df3], ignore_index=True)
print(result)

Output:

 
1    name  score
20  Alice     90
31    Bob     85
42  Carol     88
53   Dave     92
64    Eve     95

This is the most common pattern and should be your default when concatenating rows.

Row-wise vs Column-wise Concatenation

The axis parameter controls the direction. The default axis=0 stacks rows. Setting axis=1 concatenates columns side by side:

python
1df_names = pd.DataFrame({"name": ["Alice", "Bob", "Carol"]})
2df_scores = pd.DataFrame({"score": [90, 85, 88]})
3
4result = pd.concat([df_names, df_scores], axis=1)
5print(result)

Output:

 
1    name  score
20  Alice     90
31    Bob     85
42  Carol     88

Column-wise concatenation aligns on the row index, so both DataFrames must share the same index values for the result to make sense.

Handling Different Columns with join

When DataFrames have different columns, pd.concat uses an outer join by default, filling missing values with NaN:

python
1df1 = pd.DataFrame({"name": ["Alice"], "score": [90]})
2df2 = pd.DataFrame({"name": ["Bob"], "grade": ["A"]})
3
4# Outer join (default) -- keeps all columns
5outer = pd.concat([df1, df2], ignore_index=True)
6print(outer)

Output:

 
    name  score grade
0  Alice   90.0   NaN
1    Bob    NaN     A

If you only want the columns that exist in every DataFrame, use join='inner':

python
inner = pd.concat([df1, df2], ignore_index=True, join="inner")
print(inner)

Output:

 
    name
0  Alice
1    Bob

Choose inner when you need a clean result with no missing values, and outer when you want to preserve all data even if some columns are absent.

Using the keys Parameter for Hierarchical Indexing

The keys parameter creates a MultiIndex that identifies which original DataFrame each row came from. This is useful when you need to trace data back to its source:

python
1df_jan = pd.DataFrame({"revenue": [100, 200]})
2df_feb = pd.DataFrame({"revenue": [150, 250]})
3
4result = pd.concat([df_jan, df_feb], keys=["January", "February"])
5print(result)

Output:

 
1                revenue
2January  0      100
3         1      200
4February 0      150
5         1      250

You can then select a specific group with result.loc["January"].

Performance: Concat Once vs Append in a Loop

A critical performance lesson: never grow a DataFrame row by row inside a loop. Each append creates a full copy of the data, making the operation O(n^2) overall.

python
1# BAD -- quadratic time complexity
2result = pd.DataFrame()
3for filepath in file_list:
4    df = pd.read_csv(filepath)
5    result = pd.concat([result, df])  # copies all previous data each time
6
7# GOOD -- collect first, concat once
8frames = [pd.read_csv(filepath) for filepath in file_list]
9result = pd.concat(frames, ignore_index=True)

The correct pattern is to collect all DataFrames into a list first and then call pd.concat once. This runs in linear time because pandas allocates the output array once and copies each input DataFrame into it.

Comparison with merge and join

While pd.concat stacks DataFrames along an axis, pd.merge and DataFrame.join combine DataFrames based on shared column values or indices, similar to SQL joins:

python
1# concat -- stacks rows or columns positionally
2pd.concat([df1, df2], ignore_index=True)
3
4# merge -- combines on shared column values (like SQL JOIN)
5pd.merge(df1, df2, on="user_id", how="inner")
6
7# join -- merges on index by default
8df1.join(df2, how="left")

Use pd.concat when you are combining DataFrames that represent the same kind of data (same columns, different rows). Use merge or join when you are combining DataFrames that represent related but different data and need to match rows by a key.

Common Pitfalls

  • Appending in a loop instead of collecting and concatenating once makes your code orders of magnitude slower on large datasets.
  • Forgetting ignore_index=True leaves duplicate index values that cause confusing results with .loc and .iloc.
  • Assuming column order is preserved across DataFrames with different column sets. Always check .columns after concatenation.
  • Concatenating DataFrames with mismatched dtypes in the same column silently upcasts (int to float when NaN is introduced). Inspect dtypes with result.dtypes after concatenation.
  • Using axis=1 with misaligned indices produces a DataFrame full of NaN values because pandas aligns on the index, not on position.

Summary

  • Use pd.concat(list_of_dfs, ignore_index=True) as the standard pattern for stacking DataFrames row-wise.
  • Set axis=1 for column-wise concatenation, and ensure indices align.
  • Control missing-column behavior with join='inner' or join='outer' (default).
  • Use the keys parameter to create a MultiIndex that tracks which source DataFrame each row came from.
  • Always collect DataFrames into a list and call pd.concat once instead of appending inside a loop.
  • Use pd.merge or DataFrame.join when you need to combine DataFrames by matching on key columns rather than stacking.

Course illustration
Course illustration

All Rights Reserved.