Distributed Query Execution: One Slow Shard Owns Your P99

May 16, 2026

A SELECT that touches a single row on a single node is a few microseconds of work. The same shape of query running across forty shards is a small distributed system in its own right. The plan looks innocent from the outside. Inside, six things have to happen in order.

The coordinator parses the SQL and builds a plan. It pushes down predicates so each shard receives a query that only scans rows belonging to its partition. Each shard runs the scan in parallel. If the query has a join on the partition key, the join also runs shard-local, and partial results flow back. If the join is on something other than the partition key, the planner inserts a shuffle: either broadcast the smaller side to every shard, or repartition both sides on the join key so matching rows meet on the same node. The coordinator merges, sorts, and aggregates the partial results, then returns the final answer to the client.

The cost model that catches people is the tail. Query latency is roughly the max of shard latency, not the average. If thirty-nine shards finish in 80 milliseconds and one finishes in 800, the user waits 800. A single hot shard, a single noisy neighbor, a single GC pause on one node, and your P99 is gone. Hedged requests help on read replicas. Skew in the partitioning function does not, and that is usually the real problem.

A team I reviewed had a reporting query that joined orders to users and grouped by region. Both tables were sharded by user_id, but the join was on user_id, so far so good. The reports ran at a consistent eight seconds. They concluded the system was bottlenecked on per-shard CPU and doubled the shard count, expecting roughly four-second queries. Instead the queries got slower. Two effects compounded. The data was already skewed: one shard owned 80% of the active users because of a marketing campaign, and that shard's runtime had not changed. And every additional shard added a network exchange and a larger broadcast for a separate lookup table the planner kept replicating. More shards meant more fanout cost on top of the same slow shard.

The fix was partition design, not capacity. They rekeyed the hot user segment across multiple synthetic shards using a salted partition key for that subset, and denormalized the small lookup table into the orders rows so the broadcast disappeared. Query time dropped to 600 milliseconds and stayed there under load. Sharding scales scans. Joins, aggregations, and skew are still your problem.

Key takeaway

A distributed query's latency equals the slowest shard's latency, not the average. Adding shards helps scans but multiplies network exchange cost for joins. Co-locate join keys at partition design time or you will pay forever at runtime.

Originally posted on LinkedIn. View original.