Python
Pandas
Data Cleaning
Remove Duplicates
DataFrame

python pandas remove duplicate columns

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

In pandas, "duplicate columns" can mean two different things: columns with duplicate names, or columns with different names but identical data. The right removal strategy depends on which kind of duplicate you actually have.

Case 1: Duplicate Column Names

If the problem is repeated labels in df.columns, you can keep the first occurrence with a boolean mask.

python
1import pandas as pd
2
3df = pd.DataFrame([[1, 2, 3]], columns=["a", "b", "a"])
4print(df)
5
6clean = df.loc[:, ~df.columns.duplicated()]
7print(clean)

df.columns.duplicated() marks repeated column labels after the first one. Negating the mask keeps only the first occurrence.

This is usually the correct answer when duplicate columns come from merges, bad CSV headers, or manual renaming mistakes.

Keeping The Last Duplicate Name Instead

If you want the last occurrence instead of the first, pass keep="last".

python
clean = df.loc[:, ~df.columns.duplicated(keep="last")]

That is useful when later columns should override earlier ones.

Case 2: Different Names, Same Data

Sometimes the labels are unique, but two columns contain identical values. That is a different problem.

python
1import pandas as pd
2
3df = pd.DataFrame({
4    "x": [1, 2, 3],
5    "y": [4, 5, 6],
6    "z": [1, 2, 3],
7})

Here x and z are data duplicates.

A simple way to remove duplicate-by-content columns is:

python
clean = df.T.drop_duplicates().T
print(clean)

Transposing turns columns into rows, drop_duplicates() removes repeated rows, and transposing back restores the original orientation.

Why These Two Cases Should Not Be Mixed

If you remove duplicate names when the real issue is duplicate content, you may keep redundant data. If you remove duplicate content when the real issue is label collisions, you may preserve a confusing schema.

So diagnose the problem first.

A quick check for duplicate labels is:

python
print(df.columns[df.columns.duplicated()])

A quick check for duplicate-by-content columns is to compare transposed rows or use pairwise equality if the frame is small.

A Reusable Helper For Duplicate Names

python
def drop_duplicate_column_names(df: pd.DataFrame, keep: str = "first") -> pd.DataFrame:
    return df.loc[:, ~df.columns.duplicated(keep=keep)]

This keeps the intent obvious whenever the cleanup appears in multiple places.

A Reusable Helper For Duplicate Content

python
def drop_duplicate_column_values(df: pd.DataFrame) -> pd.DataFrame:
    return df.T.drop_duplicates().T

This is clear and compact, though it can be expensive on very large DataFrames because of the transpose.

Performance Considerations

For duplicate names, the df.columns.duplicated() approach is efficient and should be the default.

For duplicate content, df.T.drop_duplicates().T is elegant but may be heavy for large or wide DataFrames. If performance matters, you may need a more specialized approach based on hashing or targeted comparison.

But for many real data-cleaning tasks, the transpose trick is perfectly fine.

Duplicate Columns After Merges

One very common source of duplicate columns is a merge that preserves overlapping fields with suffixes such as _x and _y.

python
merged = left.merge(right, on="id", suffixes=("_left", "_right"))

In that situation, the right fix is often not generic deduplication. It is choosing which source column should survive and dropping the other explicitly. That is safer than removing columns blindly after the merge.

Common Pitfalls

The most common mistake is not distinguishing duplicate names from duplicate data.

Another mistake is removing duplicate labels after a merge without checking whether the repeated columns actually contain different information.

Developers also sometimes use the transpose-based content deduplication approach on very large frames without realizing the memory cost.

Finally, if you keep only the first duplicate label by default, make sure that really matches your data-cleaning rule. Sometimes the last occurrence is the one you intended to keep.

Summary

  • Duplicate column names and duplicate column contents are different problems.
  • Use df.loc[:, ~df.columns.duplicated()] to remove repeated column labels.
  • Use df.T.drop_duplicates().T to remove columns with identical values.
  • Choose whether to keep the first or last duplicate name explicitly.
  • Diagnose the duplicate type before applying the cleanup.

Course illustration
Course illustration

All Rights Reserved.