Spark
DataFrame
Tutorial
Constant Column
Big Data

How to add a constant column in a Spark DataFrame?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Adding a constant column to a Spark DataFrame is a very common transformation when you need default values, flags, source labels, or load metadata on every row. The standard approach is to use withColumn together with lit, which creates a Spark column expression representing the same value for every record.

Use withColumn and lit

In Spark, you do not assign values row by row the way you might in pandas. You build a new DataFrame by describing column expressions. For a constant value, the main tool is lit.

In PySpark:

python
1from pyspark.sql import SparkSession
2from pyspark.sql.functions import lit
3
4spark = SparkSession.builder.getOrCreate()
5
6df = spark.createDataFrame(
7    [(1, "Ana"), (2, "Sam")],
8    ["id", "name"]
9)
10
11result = df.withColumn("source", lit("import_2025"))
12result.show()

This adds a new column named source with the same string value on every row.

The same idea works in Scala:

scala
1import org.apache.spark.sql.SparkSession
2import org.apache.spark.sql.functions.lit
3
4val spark = SparkSession.builder().getOrCreate()
5import spark.implicits._
6
7val df = Seq((1, "Ana"), (2, "Sam")).toDF("id", "name")
8val result = df.withColumn("source", lit("import_2025"))
9result.show()

lit is the important piece. Without it, Spark expects a column expression rather than a raw language value.

Choose the Constant Type Deliberately

Spark infers the data type of the literal from the value you pass. That means you can add strings, numbers, booleans, and nulls:

python
1from pyspark.sql.functions import lit
2
3df = df.withColumn("is_active", lit(True))
4df = df.withColumn("batch_id", lit(17))
5df = df.withColumn("score", lit(99.5))

If you need a specific type, cast it explicitly:

python
from pyspark.sql.functions import lit

df = df.withColumn("batch_id", lit("17").cast("int"))

Being explicit is useful when the downstream schema matters. For example, writing a constant "17" as text when the target table expects an integer is a common source of avoidable schema problems.

You Can Also Replace an Existing Column

withColumn does not only create new columns. If the name already exists, Spark replaces that column with the new expression:

python
1from pyspark.sql.functions import lit
2
3df = spark.createDataFrame(
4    [(1, "pending"), (2, "pending")],
5    ["id", "status"]
6)
7
8result = df.withColumn("status", lit("processed"))
9result.show()

This is useful in normalization pipelines, but it also means you should choose column names carefully. A typo can silently overwrite data you meant to keep.

Constants Can Be Part of Larger Expressions

Once you understand lit, you can combine it with other Spark expressions. For example, add a constant prefix or build conditional logic:

python
1from pyspark.sql.functions import col, concat, lit, when
2
3df = spark.createDataFrame(
4    [(1, 120), (2, 40)],
5    ["id", "amount"]
6)
7
8result = (
9    df.withColumn("batch_label", concat(lit("batch-"), col("id")))
10      .withColumn(
11          "priority",
12          when(col("amount") > 100, lit("high")).otherwise(lit("normal"))
13      )
14)
15
16result.show()

This is still "constant column" territory, but it shows the real Spark model: constants are just one kind of column expression in a larger transformation pipeline.

SQL Style Is Also Available

If you prefer SQL expressions, you can add a constant with selectExpr:

python
result = df.selectExpr("*", "'import_2025' as source")
result.show()

This can be convenient in SQL-heavy codebases, but withColumn(..., lit(...)) is usually the clearest and most direct form.

Common Pitfalls

The biggest mistake is passing a raw value where Spark expects a column expression. lit exists specifically to bridge that gap.

Another issue is accidentally overwriting an existing column by reusing the same name in withColumn. Spark will replace the column instead of warning you.

Developers also sometimes forget to cast the literal to the type expected by downstream systems. A constant that looks correct in show() can still create schema mismatches later.

Finally, if you add many columns in a long chain, keep the transformation readable. A constant column is simple, but dozens of withColumn calls can make a pipeline harder to maintain if names and types are not chosen carefully.

Summary

  • The standard way to add a constant column in Spark is withColumn plus lit.
  • 'lit turns a raw language value into a Spark column expression.'
  • You can add strings, numbers, booleans, nulls, and cast them when needed.
  • Reusing an existing column name replaces that column.
  • Constants can also be combined with other expressions to build richer transformation logic.

Course illustration
Course illustration

All Rights Reserved.