Finding duplicate values in MySQL

MySQL

Database Management

Data Duplication

SQL Queries

Programming

Finding duplicate values in MySQL

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In MySQL, "finding duplicates" can mean two different things: finding duplicate values in one column, or finding full duplicate rows based on a combination of columns. The SQL pattern is usually GROUP BY ... HAVING COUNT(*) > 1, followed by a second query if you need the actual rows.

Find Duplicate Values In One Column

If you want to know which email addresses appear more than once:

sql

1SELECT email, COUNT(*) AS occurrences
2FROM users
3GROUP BY email
4HAVING COUNT(*) > 1;

This does not return every duplicated row. It returns each duplicated value plus how many times it appears.

That distinction matters because many cleanup jobs need the row ids, not only the repeated value.

Find The Actual Duplicate Rows

Once you know the duplicated values, join them back to the table:

sql

1SELECT u.*
2FROM users u
3JOIN (
4    SELECT email
5    FROM users
6    GROUP BY email
7    HAVING COUNT(*) > 1
8) d ON u.email = d.email
9ORDER BY u.email, u.id;

This gives you the full rows for each duplicate email, which is much more useful for review, deletion, or repair.

Find Duplicates Across Multiple Columns

Sometimes the duplicate definition is a combination such as first name plus last name plus birth date:

sql

1SELECT first_name, last_name, birth_date, COUNT(*) AS occurrences
2FROM customers
3GROUP BY first_name, last_name, birth_date
4HAVING COUNT(*) > 1;

This is the same pattern, just with more grouping columns.

If your business logic says that combination should be unique, this query reveals where the data has drifted from that rule.

Find All But One Row In Each Duplicate Group

If the goal is cleanup, you often need "the extra rows" rather than all rows equally. A common pattern is to keep the smallest id and mark the others:

sql

1SELECT u.*
2FROM users u
3JOIN (
4    SELECT email, MIN(id) AS keep_id
5    FROM users
6    GROUP BY email
7    HAVING COUNT(*) > 1
8) d ON u.email = d.email
9WHERE u.id <> d.keep_id
10ORDER BY u.email, u.id;

This is a practical way to review which rows would be deleted if you want one surviving record per duplicate group.

Add Indexes When The Table Is Large

Duplicate detection on large tables can be slow without indexes on the grouped columns. If you often search duplicates by email, an index on email helps both grouping and join-back queries.

That does not magically remove the cost of scanning and grouping large datasets, but it often makes the work more predictable.

If duplicates are not supposed to exist at all, the long-term fix is usually a UNIQUE constraint or unique index after cleanup:

sql

ALTER TABLE users
ADD CONSTRAINT uq_users_email UNIQUE (email);

Without a database-level rule, duplicate-cleanup queries become a recurring maintenance job.

Common Pitfalls

One common mistake is running only the GROUP BY query and assuming it already returned the full duplicate rows.

Another issue is forgetting that NULL values behave differently in grouping and uniqueness scenarios, depending on the database rule you are enforcing.

A third problem is deleting duplicates immediately without first deciding which row should be kept and why.

Finally, teams sometimes clean duplicates once but never add a unique constraint, so the same data problem returns later.

Summary

Use GROUP BY ... HAVING COUNT(*) > 1 to identify duplicated values or column combinations.
Join the duplicate keys back to the base table when you need the full rows.
Define clearly whether a duplicate means one repeated column or a repeated combination of columns.
For cleanup, decide which row should survive before deleting anything.
If duplicates are invalid by design, add a unique constraint after cleaning the data.