MySQL
SQL Performance
SELECT DISTINCT
GROUP BY
Database Optimization

What's faster, SELECT DISTINCT or GROUP BY in MySQL?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

SELECT DISTINCT and GROUP BY can both remove duplicates, so developers often ask which is faster in MySQL. In real systems, performance depends more on index design and execution plans than on keyword choice alone. Start with semantics first, then validate speed with EXPLAIN ANALYZE.

Semantics Should Drive the Choice

Use DISTINCT when you only need unique rows. Use GROUP BY when you need grouping with aggregate expressions.

sql
1-- Unique ids only
2SELECT DISTINCT customer_id
3FROM orders;
4
5-- Grouped analytics
6SELECT customer_id, SUM(total_amount) AS total_spend
7FROM orders
8GROUP BY customer_id;

Even if plans are similar, these queries express different intent. Clear intent improves maintainability.

Why Performance Is Often Similar

For pure dedup cases, the optimizer may produce similar internal plans for both forms. With suitable indexes, both can run in n log n style behavior or better if index access avoids heavy temporary operations.

sql
1CREATE INDEX idx_orders_customer_created
2ON orders(customer_id, created_at);
3
4SELECT DISTINCT customer_id
5FROM orders
6WHERE created_at >= '2026-01-01';
7
8SELECT customer_id
9FROM orders
10WHERE created_at >= '2026-01-01'
11GROUP BY customer_id;

Depending on data distribution, one may still win. That is why plan inspection is mandatory.

Measure With EXPLAIN ANALYZE

Use measured execution details instead of assumptions.

sql
1EXPLAIN ANALYZE
2SELECT DISTINCT city
3FROM customers;
4
5EXPLAIN ANALYZE
6SELECT city
7FROM customers
8GROUP BY city;

Compare:

  • rows examined
  • temporary table usage
  • filesort behavior
  • actual execution time

A query that spills to disk for temporary operations can be far slower even if SQL looks almost identical.

Indexing Has Bigger Impact

In practice, index quality dominates this comparison.

Good rules:

  • index leading columns used for dedup or grouping
  • include filter columns when possible
  • prefer covering indexes for read-heavy reports
sql
1CREATE INDEX idx_sales_region_status_created
2ON sales(region, status, created_at);
3
4SELECT DISTINCT region, status
5FROM sales
6WHERE created_at >= '2026-01-01';

Well-designed indexes frequently produce larger gains than switching between DISTINCT and GROUP BY.

Determinism and SQL Mode Concerns

With GROUP BY, selecting columns not included in grouping or aggregates can produce ambiguous results depending on SQL mode. Keep ONLY_FULL_GROUP_BY enabled in production to avoid unsafe queries.

For deterministic outputs, especially in reporting APIs, add explicit ordering and clear tie-handling logic.

Benchmarking Pattern for Production Decisions

Use realistic benchmarking:

  1. Test on production-like row counts and cardinality.
  2. Run each query multiple times for warm and cold cache behavior.
  3. Record p50 and p95 latency.
  4. Store plan output with benchmark results.

This prevents optimization decisions based on toy datasets.

When GROUP BY Is Clearly Better

If reporting requirements include aggregates or post-aggregation filters, GROUP BY is the correct and usually only expressive choice. Trying to force a DISTINCT query into an aggregate workflow often leads to subqueries that are harder to maintain and sometimes slower. Choosing the right semantic form early makes later feature growth easier and keeps SQL intent obvious for reviewers.

Common Pitfalls

  • Picking syntax by habit instead of query intent.
  • Judging speed without EXPLAIN ANALYZE evidence.
  • Ignoring index strategy and blaming SQL keywords.
  • Writing ambiguous GROUP BY queries under lax SQL modes.
  • Benchmarking tiny local tables and extrapolating to production traffic.

Summary

  • Use DISTINCT for pure uniqueness and GROUP BY for grouped analytics.
  • Either can be fast when optimizer plans are similar.
  • Real performance differences usually come from indexing and data shape.
  • Validate choices with EXPLAIN ANALYZE on realistic workloads.
  • Prioritize semantic clarity first, then tune measured bottlenecks.

Course illustration
Course illustration

All Rights Reserved.