MySQL
SQL
database
duplicate-rows
data-cleaning

Delete all Duplicate Rows except for One in MySQL?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Managing data duplicates is a critical task in database administration, as duplicates can lead to inaccurate data analysis and increased storage costs. A common problem when dealing with databases is the presence of duplicate rows. This article focuses on handling the deletion of duplicate rows in MySQL, keeping exactly one occurrence of each duplicate row.

Understanding Duplicate Rows in MySQL

Duplicate rows arise primarily from errors in data entry or improper database maintenance. In relational databases like MySQL, duplicates are typically considered as sets of rows that have identical values across all columns, or a subset of columns, based on a specified criterion.

Why Remove Duplicates?

  1. Data Integrity: Duplicates compromise data accuracy and reliability.
  2. Efficiency: Streamlining data improves database performance.
  3. Storage: Duplicates occupy unnecessary space, inflating storage requirements.

Identifying Duplicate Rows

Prior to removing duplicates, it's paramount to better understand what qualifies as a duplicate. Suppose you have the following table employees:

idnamedepartmentsalary
1AliceHR50000
2BobIT60000
3AliceHR50000
4CharlieIT70000
5AliceHR50000

In this table, rows with id 1, 3, and 5 are duplicates if name, department, and salary are considered.

Approach to Deleting Duplicate Rows Except One

While MySQL lacks a direct method to delete duplicates, using subqueries and temporary tables provides a workaround. Here is a method using ROW_NUMBER() to accomplish this task:

Step-by-Step Deletion Process

  1. Identify and Keep One Occurrence: Use a common table expression (CTE) or a subquery with the ROW_NUMBER() window function to assign a unique identifier to duplicate rows.
  2. Delete Extra Duplicates: Delete those rows with a ROW_NUMBER() greater than 1.

Example Query

Suppose you need to remove duplicate entries based on name, department, and salary. The following SQL commands help you achieve this:

sql
1WITH ranked_employees AS (
2    SELECT *,
3           ROW_NUMBER() OVER (PARTITION BY name, department, salary ORDER BY id) as rn
4    FROM employees
5)
6DELETE FROM employees
7WHERE id IN (
8    SELECT id
9    FROM ranked_employees
10    WHERE rn > 1
11);

Explanation

  • CTE (Common Table Expression): The ranked_employees CTE generates row numbers (rn) for each group of duplicates as partitioned by name, department, and salary.
  • Deletion: The outer DELETE command eliminates rows that have a rn greater than 1, thus ensuring that only one instance of each duplicate row remains.

Considerations and Best Practices

  • Backup: Always back up your database before performing batch deletions.
  • Test: Run these queries on a sample database or in a test environment prior to applying them to your production database.
  • Indexing: Ensure that appropriate indexes are in place to enhance query performance, especially on large datasets.

Summary Table

StepDescription
Identify DuplicatesUse CTE with ROW_NUMBER() to tag duplicate rows.
Keep One OccurrenceDelete records where ROW_NUMBER() is greater than 1.

Additional Techniques

  • Using Temporary Tables: You can utilize temporary tables to first insert distinct rows and truncate the original table before re-inserting the unique rows.
  • Using Grouping: MySQL's GROUP BY clause with the HAVING keyword can also identify duplicates but is less efficient than ROW_NUMBER() for deletion.

Conclusion

Handling duplicates in MySQL efficiently requires strategic planning and execution. Utilizing MySQL functions like ROW_NUMBER() within a CTE offers an effective approach to retain one duplicate and remove the rest. This approach ensures data integrity while optimizing database performance. Remember to adhere to best practices, including backing up your data and conducting thorough testing prior to running deletion queries on critical datasets.


Course illustration
Course illustration

All Rights Reserved.