python pandas remove duplicate columns
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In pandas, "duplicate columns" can mean two different things: columns with duplicate names, or columns with different names but identical data. The right removal strategy depends on which kind of duplicate you actually have.
Case 1: Duplicate Column Names
If the problem is repeated labels in df.columns, you can keep the first occurrence with a boolean mask.
df.columns.duplicated() marks repeated column labels after the first one. Negating the mask keeps only the first occurrence.
This is usually the correct answer when duplicate columns come from merges, bad CSV headers, or manual renaming mistakes.
Keeping The Last Duplicate Name Instead
If you want the last occurrence instead of the first, pass keep="last".
That is useful when later columns should override earlier ones.
Case 2: Different Names, Same Data
Sometimes the labels are unique, but two columns contain identical values. That is a different problem.
Here x and z are data duplicates.
A simple way to remove duplicate-by-content columns is:
Transposing turns columns into rows, drop_duplicates() removes repeated rows, and transposing back restores the original orientation.
Why These Two Cases Should Not Be Mixed
If you remove duplicate names when the real issue is duplicate content, you may keep redundant data. If you remove duplicate content when the real issue is label collisions, you may preserve a confusing schema.
So diagnose the problem first.
A quick check for duplicate labels is:
A quick check for duplicate-by-content columns is to compare transposed rows or use pairwise equality if the frame is small.
A Reusable Helper For Duplicate Names
This keeps the intent obvious whenever the cleanup appears in multiple places.
A Reusable Helper For Duplicate Content
This is clear and compact, though it can be expensive on very large DataFrames because of the transpose.
Performance Considerations
For duplicate names, the df.columns.duplicated() approach is efficient and should be the default.
For duplicate content, df.T.drop_duplicates().T is elegant but may be heavy for large or wide DataFrames. If performance matters, you may need a more specialized approach based on hashing or targeted comparison.
But for many real data-cleaning tasks, the transpose trick is perfectly fine.
Duplicate Columns After Merges
One very common source of duplicate columns is a merge that preserves overlapping fields with suffixes such as _x and _y.
In that situation, the right fix is often not generic deduplication. It is choosing which source column should survive and dropping the other explicitly. That is safer than removing columns blindly after the merge.
Common Pitfalls
The most common mistake is not distinguishing duplicate names from duplicate data.
Another mistake is removing duplicate labels after a merge without checking whether the repeated columns actually contain different information.
Developers also sometimes use the transpose-based content deduplication approach on very large frames without realizing the memory cost.
Finally, if you keep only the first duplicate label by default, make sure that really matches your data-cleaning rule. Sometimes the last occurrence is the one you intended to keep.
Summary
- Duplicate column names and duplicate column contents are different problems.
- Use
df.loc[:, ~df.columns.duplicated()]to remove repeated column labels. - Use
df.T.drop_duplicates().Tto remove columns with identical values. - Choose whether to keep the first or last duplicate name explicitly.
- Diagnose the duplicate type before applying the cleanup.

