Delete all Duplicate Rows except for One in MySQL?

MySQL

SQL

database

duplicate-rows

data-cleaning

Delete all Duplicate Rows except for One in MySQL?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Managing data duplicates is a critical task in database administration, as duplicates can lead to inaccurate data analysis and increased storage costs. A common problem when dealing with databases is the presence of duplicate rows. This article focuses on handling the deletion of duplicate rows in MySQL, keeping exactly one occurrence of each duplicate row.

Understanding Duplicate Rows in MySQL

Duplicate rows arise primarily from errors in data entry or improper database maintenance. In relational databases like MySQL, duplicates are typically considered as sets of rows that have identical values across all columns, or a subset of columns, based on a specified criterion.

Why Remove Duplicates?

Data Integrity: Duplicates compromise data accuracy and reliability.
Efficiency: Streamlining data improves database performance.
Storage: Duplicates occupy unnecessary space, inflating storage requirements.

Identifying Duplicate Rows

Prior to removing duplicates, it's paramount to better understand what qualifies as a duplicate. Suppose you have the following table employees:

id	name	department	salary
1	Alice	HR	50000
2	Bob	IT	60000
3	Alice	HR	50000
4	Charlie	IT	70000
5	Alice	HR	50000

In this table, rows with id 1, 3, and 5 are duplicates if name, department, and salary are considered.

Approach to Deleting Duplicate Rows Except One

While MySQL lacks a direct method to delete duplicates, using subqueries and temporary tables provides a workaround. Here is a method using ROW_NUMBER() to accomplish this task:

Step-by-Step Deletion Process

Identify and Keep One Occurrence: Use a common table expression (CTE) or a subquery with the ROW_NUMBER() window function to assign a unique identifier to duplicate rows.
Delete Extra Duplicates: Delete those rows with a ROW_NUMBER() greater than 1.

Example Query

Suppose you need to remove duplicate entries based on name, department, and salary. The following SQL commands help you achieve this:

sql

1WITH ranked_employees AS (
2    SELECT *,
3           ROW_NUMBER() OVER (PARTITION BY name, department, salary ORDER BY id) as rn
4    FROM employees
5)
6DELETE FROM employees
7WHERE id IN (
8    SELECT id
9    FROM ranked_employees
10    WHERE rn > 1
11);

Explanation

CTE (Common Table Expression): The ranked_employees CTE generates row numbers (rn) for each group of duplicates as partitioned by name, department, and salary.
Deletion: The outer DELETE command eliminates rows that have a rn greater than 1, thus ensuring that only one instance of each duplicate row remains.

Considerations and Best Practices

Backup: Always back up your database before performing batch deletions.
Test: Run these queries on a sample database or in a test environment prior to applying them to your production database.
Indexing: Ensure that appropriate indexes are in place to enhance query performance, especially on large datasets.

Summary Table

Step	Description
Identify Duplicates	Use CTE with `ROW_NUMBER()` to tag duplicate rows.
Keep One Occurrence	Delete records where `ROW_NUMBER()` is greater than 1.

Additional Techniques

Using Temporary Tables: You can utilize temporary tables to first insert distinct rows and truncate the original table before re-inserting the unique rows.
Using Grouping: MySQL's GROUP BY clause with the HAVING keyword can also identify duplicates but is less efficient than ROW_NUMBER() for deletion.

Conclusion

Handling duplicates in MySQL efficiently requires strategic planning and execution. Utilizing MySQL functions like ROW_NUMBER() within a CTE offers an effective approach to retain one duplicate and remove the rest. This approach ensures data integrity while optimizing database performance. Remember to adhere to best practices, including backing up your data and conducting thorough testing prior to running deletion queries on critical datasets.