Delete all Duplicate Rows except for One in MySQL?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Managing data duplicates is a critical task in database administration, as duplicates can lead to inaccurate data analysis and increased storage costs. A common problem when dealing with databases is the presence of duplicate rows. This article focuses on handling the deletion of duplicate rows in MySQL, keeping exactly one occurrence of each duplicate row.
Understanding Duplicate Rows in MySQL
Duplicate rows arise primarily from errors in data entry or improper database maintenance. In relational databases like MySQL, duplicates are typically considered as sets of rows that have identical values across all columns, or a subset of columns, based on a specified criterion.
Why Remove Duplicates?
- Data Integrity: Duplicates compromise data accuracy and reliability.
- Efficiency: Streamlining data improves database performance.
- Storage: Duplicates occupy unnecessary space, inflating storage requirements.
Identifying Duplicate Rows
Prior to removing duplicates, it's paramount to better understand what qualifies as a duplicate. Suppose you have the following table employees:
| id | name | department | salary |
| 1 | Alice | HR | 50000 |
| 2 | Bob | IT | 60000 |
| 3 | Alice | HR | 50000 |
| 4 | Charlie | IT | 70000 |
| 5 | Alice | HR | 50000 |
In this table, rows with id 1, 3, and 5 are duplicates if name, department, and salary are considered.
Approach to Deleting Duplicate Rows Except One
While MySQL lacks a direct method to delete duplicates, using subqueries and temporary tables provides a workaround. Here is a method using ROW_NUMBER() to accomplish this task:
Step-by-Step Deletion Process
- Identify and Keep One Occurrence: Use a common table expression (CTE) or a subquery with the
ROW_NUMBER()window function to assign a unique identifier to duplicate rows. - Delete Extra Duplicates: Delete those rows with a
ROW_NUMBER()greater than 1.
Example Query
Suppose you need to remove duplicate entries based on name, department, and salary. The following SQL commands help you achieve this:
Explanation
- CTE (Common Table Expression): The
ranked_employeesCTE generates row numbers (rn) for each group of duplicates as partitioned byname,department, andsalary. - Deletion: The outer
DELETEcommand eliminates rows that have arngreater than 1, thus ensuring that only one instance of each duplicate row remains.
Considerations and Best Practices
- Backup: Always back up your database before performing batch deletions.
- Test: Run these queries on a sample database or in a test environment prior to applying them to your production database.
- Indexing: Ensure that appropriate indexes are in place to enhance query performance, especially on large datasets.
Summary Table
| Step | Description |
| Identify Duplicates | Use CTE with ROW_NUMBER() to tag duplicate rows. |
| Keep One Occurrence | Delete records where ROW_NUMBER() is greater than 1. |
Additional Techniques
- Using Temporary Tables: You can utilize temporary tables to first insert distinct rows and truncate the original table before re-inserting the unique rows.
- Using Grouping: MySQL's
GROUP BYclause with theHAVINGkeyword can also identify duplicates but is less efficient thanROW_NUMBER()for deletion.
Conclusion
Handling duplicates in MySQL efficiently requires strategic planning and execution. Utilizing MySQL functions like ROW_NUMBER() within a CTE offers an effective approach to retain one duplicate and remove the rest. This approach ensures data integrity while optimizing database performance. Remember to adhere to best practices, including backing up your data and conducting thorough testing prior to running deletion queries on critical datasets.

