PySpark
DataFrame
columns
count
big data

Count number of columns in pyspark Dataframe?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Counting columns in a PySpark DataFrame is simple once you know where Spark stores schema information. Unlike row counts, column counts do not require a distributed computation, because the schema is already available on the driver.

Use len(df.columns) for Top-Level Columns

The most common answer is also the simplest: use Python's len on the DataFrame's columns list.

python
1from pyspark.sql import SparkSession
2
3spark = SparkSession.builder.getOrCreate()
4
5df = spark.createDataFrame(
6    [(1, "Ada", 36), (2, "Linus", 31)],
7    ["id", "name", "age"],
8)
9
10print(df.columns)
11print(len(df.columns))  # 3

df.columns is just a Python list of top-level column names, so len(df.columns) is fast and clear.

len(df.schema) Works Too

Spark also stores the schema as a StructType, and its length gives the same top-level column count.

python
print(df.schema)
print(len(df.schema))  # 3

This is useful when you are already working with schema objects and want to inspect field types at the same time.

Understand Top-Level Versus Nested Fields

One subtle point is that both len(df.columns) and len(df.schema) count only top-level columns. If a column contains a nested struct, it still counts as one column.

python
1from pyspark.sql import Row
2
3df_nested = spark.createDataFrame([
4    Row(id=1, person=Row(name="Ada", age=36))
5])
6
7print(df_nested.columns)      # ['id', 'person']
8print(len(df_nested.columns)) # 2
9df_nested.printSchema()

If your real question is "how many nested fields exist in the schema," you need a separate recursive schema walk. That is a different problem from counting DataFrame columns.

Why This Is Cheap

Developers sometimes worry that reading schema metadata will be expensive on a large cluster. For column count, it is not. Spark does not scan the data to answer this question; it reads the already-known schema definition.

That means this is cheap:

python
num_cols = len(df.columns)
print(f"top-level columns: {num_cols}")

What would be expensive is using an action such as count() to answer a completely different question about rows.

This distinction matters in large ETL jobs. Many performance problems come from accidentally triggering a row action when all you really needed was schema metadata.

Practical Uses in ETL Code

Column counts are often useful for defensive checks in pipelines:

python
expected_columns = 3
if len(df.columns) != expected_columns:
    raise ValueError("Unexpected schema width")

You can also compare schemas before and after transformations:

python
1df2 = df.withColumn("country_code", df.name.substr(1, 2))
2
3print(len(df.columns))   # 3
4print(len(df2.columns))  # 4

This kind of check is helpful when joins, selects, or schema evolution may quietly add or remove fields.

In production pipelines, column counts are often paired with schema snapshots or unit tests so a data contract change fails early instead of propagating silently.

That is one reason column counting is still useful even though it is simple. It serves as a lightweight guardrail around schema drift.

Common Pitfalls

  • Confusing column count with row count and calling df.count() by mistake.
  • Assuming nested struct fields are counted individually by len(df.columns).
  • Forgetting that select and join operations may change schema width.
  • Printing a schema object and assuming nested output means more top-level columns.
  • Using a column count as a proxy for schema quality without checking names and types too.

Summary

  • Use len(df.columns) for the top-level number of columns in a PySpark DataFrame.
  • 'len(df.schema) gives the same top-level answer because the schema stores the fields.'
  • Column counting is cheap because it uses metadata, not a distributed scan.
  • Nested struct fields still count as one top-level column unless you inspect them recursively.
  • Combine column-count checks with schema-name and type validation in real ETL code.

Course illustration
Course illustration

All Rights Reserved.