pandas
DataFrame
data manipulation
Python
programming tutorial

Creating an empty Pandas DataFrame, and then filling it

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Creating an empty DataFrame is easy in Pandas; filling it efficiently is the part that matters. The best approach depends on whether you are adding a handful of rows interactively or building a large table inside a loop, where repeated row-by-row mutation becomes unnecessarily expensive.

Core Sections

Start with the schema you actually need

If you already know the column names, create the empty DataFrame with those columns up front. That makes later inserts clearer and reduces accidental typo-based columns.

python
1import pandas as pd
2
3df = pd.DataFrame(columns=["user_id", "score", "passed"])
4print(df)

Output:

text
Empty DataFrame
Columns: [user_id, score, passed]
Index: []

You can also set dtypes early if the downstream code depends on them:

python
1df = pd.DataFrame({
2    "user_id": pd.Series(dtype="int64"),
3    "score": pd.Series(dtype="float64"),
4    "passed": pd.Series(dtype="bool"),
5})

That is useful when an empty DataFrame would otherwise default to object in places you did not intend.

For a few rows, .loc is fine

If you are adding only a small number of rows, direct assignment with .loc is readable and perfectly acceptable.

python
1import pandas as pd
2
3df = pd.DataFrame(columns=["user_id", "score", "passed"])
4
5df.loc[len(df)] = [101, 87.5, True]
6df.loc[len(df)] = [102, 61.0, False]
7
8print(df)

That works because len(df) points to the next row index. It is simple for quick scripts, prototypes, and notebook work.

For many rows, collect first and build once

Row-by-row growth is slow because DataFrames are column-oriented structures. Every append-like operation can trigger copying and index management. For larger inputs, collect rows in a list and create the DataFrame once.

python
1import pandas as pd
2
3rows = []
4
5for user_id, score in [(101, 87.5), (102, 61.0), (103, 92.0)]:
6    rows.append({
7        "user_id": user_id,
8        "score": score,
9        "passed": score >= 70,
10    })
11
12df = pd.DataFrame(rows)
13print(df)

This pattern is faster, easier to test, and more natural when the data is coming from an API response, parser, or transformation pipeline.

Avoid the old append pattern

Older examples on the internet often show df = df.append(row, ignore_index=True). That pattern was inefficient and has been removed from modern Pandas versions. If you need to combine chunks, use pd.concat on a list of DataFrames instead.

python
1import pandas as pd
2
3chunk_a = pd.DataFrame([
4    {"user_id": 101, "score": 87.5, "passed": True},
5])
6
7chunk_b = pd.DataFrame([
8    {"user_id": 102, "score": 61.0, "passed": False},
9])
10
11df = pd.concat([chunk_a, chunk_b], ignore_index=True)
12print(df)

That approach scales better than repeatedly rebuilding the table one row at a time.

Choose the right pattern for your data source

A good rule is:

  • interactive or tiny data: use .loc
  • streaming or loop-generated rows: store dictionaries in a list, then call pd.DataFrame
  • chunked processing: build DataFrames per chunk and combine with pd.concat

If you are reading structured records from JSON, SQL, or CSV, it is often better to skip the empty DataFrame entirely and construct the final DataFrame directly from the source records.

Keep an eye on indexes and dtypes

When filling an initially empty DataFrame, two silent issues show up often:

  • unexpected index values
  • columns becoming object because early rows contain mixed types

Reset the index after concatenation if needed:

python
df = pd.concat([chunk_a, chunk_b], ignore_index=True)

And cast explicitly when the schema matters:

python
1df = df.astype({
2    "user_id": "int64",
3    "score": "float64",
4    "passed": "bool",
5})

That is especially useful before exporting to Parquet, writing tests, or handing the DataFrame to code that expects stable types.

Common Pitfalls

  • Growing a DataFrame one row at a time inside a large loop and then wondering why it is slow.
  • Using deprecated append examples copied from old blog posts.
  • Forgetting to declare columns up front and accidentally creating misspelled column names.
  • Letting empty-column defaults turn everything into object dtype when numeric or boolean types were intended.
  • Ignoring index behavior during concatenation and ending up with duplicate or unexpected row labels.

Summary

  • An empty DataFrame is easy to create, but the fill strategy should match the data volume.
  • Use .loc for small, simple additions and list accumulation for larger workloads.
  • Prefer building once or concatenating chunks instead of repeated append-like operations.
  • Define columns and dtypes early when schema correctness matters.
  • Watch indexes and types as the table grows so downstream code stays predictable.

Course illustration
Course illustration

All Rights Reserved.