How to change dataframe column names in PySpark?

PySpark

DataFrame

column names

rename

tutorial

How to change dataframe column names in PySpark?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Renaming columns is a common PySpark task when standardizing messy source schemas for analytics and machine learning pipelines. The right technique depends on whether you are renaming a few columns, normalizing an entire schema, or combining rename with transformations. A robust approach includes collision checks and schema validation so downstream jobs do not break unexpectedly.

Rename a Few Columns With `withColumnRenamed`

For small targeted changes, withColumnRenamed is explicit and readable.

python

1from pyspark.sql import SparkSession
2
3spark = SparkSession.builder.getOrCreate()
4
5df = spark.createDataFrame(
6    [(1, "Alice", 88.5), (2, "Bob", 92.0)],
7    ["id", "student name", "final score"],
8)
9
10renamed = (
11    df.withColumnRenamed("student name", "student_name")
12      .withColumnRenamed("final score", "final_score")
13)
14
15renamed.show()
16renamed.printSchema()

This style is ideal for one to three columns and quick refactors.

Normalize Whole Schema With `toDF`

If most columns need normalization, generate the full output name list and use toDF.

python

1import re
2
3
4def normalize(name: str) -> str:
5    name = name.strip().lower()
6    name = re.sub(r"[^a-z0-9_]+", "_", name)
7    name = re.sub(r"_+", "_", name)
8    return name.strip("_")
9
10
11new_cols = [normalize(c) for c in df.columns]
12normalized = df.toDF(*new_cols)
13normalized.printSchema()

This keeps transformation concise and easy to apply across many data sources.

Detect Name Collisions Before Applying

Normalization can map different source names to the same output, for example A-B and A B both becoming a_b. Add collision checks before rename.

python

1from collections import Counter
2
3
4def assert_unique(names: list[str]) -> None:
5    counts = Counter(names)
6    duplicates = [n for n, c in counts.items() if c > 1]
7    if duplicates:
8        raise ValueError(f"duplicate names after normalization: {duplicates}")
9
10
11candidate_cols = [normalize(c) for c in df.columns]
12assert_unique(candidate_cols)
13safe_df = df.toDF(*candidate_cols)

Failing fast here prevents subtle downstream join and select errors.

Rename While Transforming With `select` Aliases

When you also cast data types or create derived fields, select with aliases is often the cleanest single-pass pattern.

python

1from pyspark.sql import functions as F
2
3curated = df.select(
4    F.col("id").cast("long").alias("student_id"),
5    F.col("student name").alias("student_name"),
6    F.col("final score").cast("double").alias("final_score"),
7    F.current_timestamp().alias("processed_at"),
8)
9
10curated.show(truncate=False)
11curated.printSchema()

This is especially useful in silver and gold data layer jobs where output schema is contract-driven.

Nested Field Renaming Considerations

Nested struct fields cannot be renamed with the same simple APIs as top-level columns. You usually rebuild the struct with aliased children.

python

1from pyspark.sql import functions as F
2
3nested = spark.createDataFrame(
4    [(1, ("Alice", 20))],
5    "id INT, profile STRUCT<name: STRING, age: INT>",
6)
7
8rebuilt = nested.select(
9    "id",
10    F.struct(
11        F.col("profile.name").alias("full_name"),
12        F.col("profile.age").alias("age"),
13    ).alias("profile"),
14)
15
16rebuilt.printSchema()

For complex nested schemas, keep expected schema snapshots in tests.

Operational Best Practices

Adopt one naming policy for all ingestion jobs, for example lowercase with underscores, and keep it in a shared utility module.

Recommended safeguards:

schema assertions after rename
collision detection in CI
versioned contracts for major tables
clear migration notes when names change

Simple assertion example:

python

expected = {"student_id", "student_name", "final_score"}
assert expected.issubset(set(curated.columns))

These checks catch accidental schema drift early.

Common Pitfalls

A common pitfall is assuming DataFrames mutate in place after rename calls. Another is mixing multiple ad hoc naming rules across pipelines, which creates inconsistent schemas. Teams often forget collision checks and produce duplicate names that fail later during joins. Renaming columns without updating dependent SQL and dashboards can break downstream consumers. Nested-field renaming is also frequently treated like top-level renaming, which does not work.

Summary

Use withColumnRenamed for small explicit changes.
Use toDF with generated names for broad schema normalization.
Always detect duplicate output names before applying renames.
Use select aliases when renaming and transforming together.
Rebuild nested structs for nested-field renaming.
Enforce naming policy and schema checks in shared pipeline utilities.

How to change dataframe column names in PySpark?

Master System Design with Codemia

Introduction

Rename a Few Columns With withColumnRenamed

Normalize Whole Schema With toDF

Detect Name Collisions Before Applying

Rename While Transforming With select Aliases

Nested Field Renaming Considerations

Operational Best Practices

Common Pitfalls

Summary

Rename a Few Columns With `withColumnRenamed`

Normalize Whole Schema With `toDF`

Rename While Transforming With `select` Aliases