PySpark
DataFrame
column names
rename
tutorial

How to change dataframe column names in PySpark?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Renaming columns is a common PySpark task when standardizing messy source schemas for analytics and machine learning pipelines. The right technique depends on whether you are renaming a few columns, normalizing an entire schema, or combining rename with transformations. A robust approach includes collision checks and schema validation so downstream jobs do not break unexpectedly.

Rename a Few Columns With withColumnRenamed

For small targeted changes, withColumnRenamed is explicit and readable.

python
1from pyspark.sql import SparkSession
2
3spark = SparkSession.builder.getOrCreate()
4
5df = spark.createDataFrame(
6    [(1, "Alice", 88.5), (2, "Bob", 92.0)],
7    ["id", "student name", "final score"],
8)
9
10renamed = (
11    df.withColumnRenamed("student name", "student_name")
12      .withColumnRenamed("final score", "final_score")
13)
14
15renamed.show()
16renamed.printSchema()

This style is ideal for one to three columns and quick refactors.

Normalize Whole Schema With toDF

If most columns need normalization, generate the full output name list and use toDF.

python
1import re
2
3
4def normalize(name: str) -> str:
5    name = name.strip().lower()
6    name = re.sub(r"[^a-z0-9_]+", "_", name)
7    name = re.sub(r"_+", "_", name)
8    return name.strip("_")
9
10
11new_cols = [normalize(c) for c in df.columns]
12normalized = df.toDF(*new_cols)
13normalized.printSchema()

This keeps transformation concise and easy to apply across many data sources.

Detect Name Collisions Before Applying

Normalization can map different source names to the same output, for example A-B and A B both becoming a_b. Add collision checks before rename.

python
1from collections import Counter
2
3
4def assert_unique(names: list[str]) -> None:
5    counts = Counter(names)
6    duplicates = [n for n, c in counts.items() if c > 1]
7    if duplicates:
8        raise ValueError(f"duplicate names after normalization: {duplicates}")
9
10
11candidate_cols = [normalize(c) for c in df.columns]
12assert_unique(candidate_cols)
13safe_df = df.toDF(*candidate_cols)

Failing fast here prevents subtle downstream join and select errors.

Rename While Transforming With select Aliases

When you also cast data types or create derived fields, select with aliases is often the cleanest single-pass pattern.

python
1from pyspark.sql import functions as F
2
3curated = df.select(
4    F.col("id").cast("long").alias("student_id"),
5    F.col("student name").alias("student_name"),
6    F.col("final score").cast("double").alias("final_score"),
7    F.current_timestamp().alias("processed_at"),
8)
9
10curated.show(truncate=False)
11curated.printSchema()

This is especially useful in silver and gold data layer jobs where output schema is contract-driven.

Nested Field Renaming Considerations

Nested struct fields cannot be renamed with the same simple APIs as top-level columns. You usually rebuild the struct with aliased children.

python
1from pyspark.sql import functions as F
2
3nested = spark.createDataFrame(
4    [(1, ("Alice", 20))],
5    "id INT, profile STRUCT<name: STRING, age: INT>",
6)
7
8rebuilt = nested.select(
9    "id",
10    F.struct(
11        F.col("profile.name").alias("full_name"),
12        F.col("profile.age").alias("age"),
13    ).alias("profile"),
14)
15
16rebuilt.printSchema()

For complex nested schemas, keep expected schema snapshots in tests.

Operational Best Practices

Adopt one naming policy for all ingestion jobs, for example lowercase with underscores, and keep it in a shared utility module.

Recommended safeguards:

  • schema assertions after rename
  • collision detection in CI
  • versioned contracts for major tables
  • clear migration notes when names change

Simple assertion example:

python
expected = {"student_id", "student_name", "final_score"}
assert expected.issubset(set(curated.columns))

These checks catch accidental schema drift early.

Common Pitfalls

A common pitfall is assuming DataFrames mutate in place after rename calls. Another is mixing multiple ad hoc naming rules across pipelines, which creates inconsistent schemas. Teams often forget collision checks and produce duplicate names that fail later during joins. Renaming columns without updating dependent SQL and dashboards can break downstream consumers. Nested-field renaming is also frequently treated like top-level renaming, which does not work.

Summary

  • Use withColumnRenamed for small explicit changes.
  • Use toDF with generated names for broad schema normalization.
  • Always detect duplicate output names before applying renames.
  • Use select aliases when renaming and transforming together.
  • Rebuild nested structs for nested-field renaming.
  • Enforce naming policy and schema checks in shared pipeline utilities.

Course illustration
Course illustration

All Rights Reserved.