pandas
CSV files
DataFrame
Python
data import

Import multiple CSV files into pandas and concatenate into one DataFrame

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Combining many CSV files into one pandas DataFrame is a common preprocessing step in reporting and analysis workflows. The safest pattern is to collect the file paths, read each file with consistent options, and concatenate the resulting data frames in one call.

Read Files with Path.glob() and pd.read_csv()

If all CSV files live in one directory, pathlib and pandas work well together:

python
1from pathlib import Path
2import pandas as pd
3
4folder = Path("data")
5files = sorted(folder.glob("*.csv"))
6
7frames = [pd.read_csv(file) for file in files]
8combined = pd.concat(frames, ignore_index=True)
9
10print(combined.head())

This pattern is simple and easy to review:

  • 'glob("*.csv") finds the files'
  • 'sorted(...) makes the processing order deterministic'
  • 'pd.read_csv(...) turns each file into a DataFrame'
  • 'pd.concat(...) stacks them vertically'

ignore_index=True is usually the right choice because each source file often starts its row index at 0. Resetting the index avoids duplicate index values in the final result.

Preserve the Source File Name

In real projects, it is often useful to know which row came from which file. Add that information before concatenating:

python
1from pathlib import Path
2import pandas as pd
3
4folder = Path("data")
5frames = []
6
7for file in sorted(folder.glob("*.csv")):
8    frame = pd.read_csv(file)
9    frame["source_file"] = file.name
10    frames.append(frame)
11
12combined = pd.concat(frames, ignore_index=True)
13print(combined[["source_file"]].drop_duplicates())

This makes debugging much easier when one file has malformed rows or unexpected column values.

Handle Schema Differences Explicitly

Concatenation works best when the CSV files share the same columns. If one file has extra columns or missing columns, pandas aligns by column name and fills missing values with NaN.

That behavior is helpful, but it can also hide data quality problems. It is often worth validating the columns before concatenating:

python
1from pathlib import Path
2import pandas as pd
3
4folder = Path("data")
5expected_columns = None
6frames = []
7
8for file in sorted(folder.glob("*.csv")):
9    frame = pd.read_csv(file)
10    columns = tuple(frame.columns)
11
12    if expected_columns is None:
13        expected_columns = columns
14    elif columns != expected_columns:
15        raise ValueError(f"Unexpected columns in {file.name}: {columns}")
16
17    frames.append(frame)
18
19combined = pd.concat(frames, ignore_index=True)

This is a better failure mode than silently merging inconsistent data and discovering the issue much later.

Watch Memory Use on Large Imports

Reading dozens of large CSV files into memory at once can become expensive. If the dataset is large, a few strategies help:

  • read only required columns with usecols
  • specify dtype where possible to reduce inference overhead
  • process files in chunks if you do not need one huge in-memory frame

Example with selected columns:

python
frame = pd.read_csv(file, usecols=["date", "sales", "region"])

If the final output must be a single DataFrame, you still need enough memory to hold it. At that point, the main optimization is to avoid unnecessary columns and unnecessary intermediate copies.

Common Pitfalls

The most common mistake is assuming every CSV has the same schema. Pandas will align columns by name, which is convenient, but it can mask missing or renamed columns.

Another issue is forgetting file encoding. If some files are UTF-8 and others are not, read_csv() may fail or produce corrupted text unless you pass the correct encoding.

Empty file lists are another practical problem. If glob("*.csv") matches nothing, pd.concat([]) raises an error. It is better to handle that case explicitly in production code.

Finally, avoid concatenating inside the loop on every iteration. Build a list of frames and call pd.concat() once. Repeated concatenation is slower because pandas keeps allocating new intermediate objects.

Summary

  • Use Path.glob() to gather CSV files and pd.read_csv() to load each one.
  • Concatenate once with pd.concat(frames, ignore_index=True).
  • Add a source_file column when traceability matters.
  • Validate columns early so schema mismatches fail fast.
  • Be deliberate about memory by selecting only the columns and types you need.

Course illustration
Course illustration

All Rights Reserved.