Finding duplicate values in MySQL
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In MySQL, "finding duplicates" can mean two different things: finding duplicate values in one column, or finding full duplicate rows based on a combination of columns. The SQL pattern is usually GROUP BY ... HAVING COUNT(*) > 1, followed by a second query if you need the actual rows.
Find Duplicate Values In One Column
If you want to know which email addresses appear more than once:
This does not return every duplicated row. It returns each duplicated value plus how many times it appears.
That distinction matters because many cleanup jobs need the row ids, not only the repeated value.
Find The Actual Duplicate Rows
Once you know the duplicated values, join them back to the table:
This gives you the full rows for each duplicate email, which is much more useful for review, deletion, or repair.
Find Duplicates Across Multiple Columns
Sometimes the duplicate definition is a combination such as first name plus last name plus birth date:
This is the same pattern, just with more grouping columns.
If your business logic says that combination should be unique, this query reveals where the data has drifted from that rule.
Find All But One Row In Each Duplicate Group
If the goal is cleanup, you often need "the extra rows" rather than all rows equally. A common pattern is to keep the smallest id and mark the others:
This is a practical way to review which rows would be deleted if you want one surviving record per duplicate group.
Add Indexes When The Table Is Large
Duplicate detection on large tables can be slow without indexes on the grouped columns. If you often search duplicates by email, an index on email helps both grouping and join-back queries.
That does not magically remove the cost of scanning and grouping large datasets, but it often makes the work more predictable.
If duplicates are not supposed to exist at all, the long-term fix is usually a UNIQUE constraint or unique index after cleanup:
Without a database-level rule, duplicate-cleanup queries become a recurring maintenance job.
Common Pitfalls
One common mistake is running only the GROUP BY query and assuming it already returned the full duplicate rows.
Another issue is forgetting that NULL values behave differently in grouping and uniqueness scenarios, depending on the database rule you are enforcing.
A third problem is deleting duplicates immediately without first deciding which row should be kept and why.
Finally, teams sometimes clean duplicates once but never add a unique constraint, so the same data problem returns later.
Summary
- Use
GROUP BY ... HAVING COUNT(*) > 1to identify duplicated values or column combinations. - Join the duplicate keys back to the base table when you need the full rows.
- Define clearly whether a duplicate means one repeated column or a repeated combination of columns.
- For cleanup, decide which row should survive before deleting anything.
- If duplicates are invalid by design, add a unique constraint after cleaning the data.

