Import multiple CSV files into pandas and concatenate into one DataFrame
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Combining many CSV files into one pandas DataFrame is a common preprocessing step in reporting and analysis workflows. The safest pattern is to collect the file paths, read each file with consistent options, and concatenate the resulting data frames in one call.
Read Files with Path.glob() and pd.read_csv()
If all CSV files live in one directory, pathlib and pandas work well together:
This pattern is simple and easy to review:
- '
glob("*.csv")finds the files' - '
sorted(...)makes the processing order deterministic' - '
pd.read_csv(...)turns each file into aDataFrame' - '
pd.concat(...)stacks them vertically'
ignore_index=True is usually the right choice because each source file often starts its row index at 0. Resetting the index avoids duplicate index values in the final result.
Preserve the Source File Name
In real projects, it is often useful to know which row came from which file. Add that information before concatenating:
This makes debugging much easier when one file has malformed rows or unexpected column values.
Handle Schema Differences Explicitly
Concatenation works best when the CSV files share the same columns. If one file has extra columns or missing columns, pandas aligns by column name and fills missing values with NaN.
That behavior is helpful, but it can also hide data quality problems. It is often worth validating the columns before concatenating:
This is a better failure mode than silently merging inconsistent data and discovering the issue much later.
Watch Memory Use on Large Imports
Reading dozens of large CSV files into memory at once can become expensive. If the dataset is large, a few strategies help:
- read only required columns with
usecols - specify
dtypewhere possible to reduce inference overhead - process files in chunks if you do not need one huge in-memory frame
Example with selected columns:
If the final output must be a single DataFrame, you still need enough memory to hold it. At that point, the main optimization is to avoid unnecessary columns and unnecessary intermediate copies.
Common Pitfalls
The most common mistake is assuming every CSV has the same schema. Pandas will align columns by name, which is convenient, but it can mask missing or renamed columns.
Another issue is forgetting file encoding. If some files are UTF-8 and others are not, read_csv() may fail or produce corrupted text unless you pass the correct encoding.
Empty file lists are another practical problem. If glob("*.csv") matches nothing, pd.concat([]) raises an error. It is better to handle that case explicitly in production code.
Finally, avoid concatenating inside the loop on every iteration. Build a list of frames and call pd.concat() once. Repeated concatenation is slower because pandas keeps allocating new intermediate objects.
Summary
- Use
Path.glob()to gather CSV files andpd.read_csv()to load each one. - Concatenate once with
pd.concat(frames, ignore_index=True). - Add a
source_filecolumn when traceability matters. - Validate columns early so schema mismatches fail fast.
- Be deliberate about memory by selecting only the columns and types you need.

