Sharding Strategies for Multi-Tenant and Large-Scale Systems

Course

System Design Fundamentals

Sharding Strategies for Multi-Tenant and Large-Scale Systems

Topics Covered

Introduction to Sharding in Large-Scale Multi-Tenant Systems

What makes a good shard key

Three core strategies

Sharding vs. partitioning

When NOT to shard

Hash-Based Sharding (Partition by Hash)

The mod-N problem and consistent hashing

Virtual nodes for even distribution

DynamoDB's adaptive splitting

Write amplification in hash-sharded systems

Handling shard failures

When to use hash sharding

Scatter-gather: the cost of cross-shard queries

Hash sharding and secondary indexes

Shard key immutability

Co-location: keeping related data together

Routing layer architecture

Monitoring hash distribution quality

Consistent hashing: a deeper look

Choosing the number of shards

Data locality and network topology

Range-Based Sharding (Partition by Key Range)

Mitigating hot ranges

Rebalancing range shards

Merging underutilized shards

Range sharding and TTL-based expiry

When to use range sharding

Compound range keys

Range sharding and read amplification

Range sharding and write amplification

Hot shard detection and automatic remediation

Directory-Based Sharding (Lookup Service or Dynamic Partitioning)

Multi-tenant isolation

Noisy neighbor isolation

The directory as a critical dependency

Geographic placement

When to use directory sharding

Directory sharding implementation patterns

Tenant migration workflow

Tenant-aware connection pooling

Cost analysis: directory vs. hash vs. range

Hybrid Sharding Approaches and Real-World Comparisons

MongoDB: hashed range chunks

Cassandra and DynamoDB: consistent hashing with virtual partitions

Vitess: range sharding with online resharding

Bigtable and HBase: range tablets with automatic split/merge

CockroachDB and YugabyteDB: NewSQL sharding

How to discuss sharding in interviews

Comparison table

Cross-shard transactions

Decision framework

Anti-patterns to avoid

Operational runbook essentials

Testing sharded systems

Capacity planning for sharded systems

The cost of getting it wrong

Sharding and data warehousing

Immutable data and append-only sharding

Conclusion

Cross-shard operations: the hidden cost

Monitoring and operational readiness

Schema migrations across shards

Summary decision table

Introduction to Sharding in Large-Scale Multi-Tenant Systems

Why does every large-scale system eventually shard? Because vertical scaling hits a wall. You can double a server's RAM from 256 GB to 512 GB, but you cannot double it to 64 TB. At some point, your dataset or write throughput exceeds what any single machine can handle, and the only option is to split the data across multiple machines. That split is sharding.

Sharding (horizontal partitioning) divides a dataset into shards, where each shard holds a disjoint subset of the data and runs on its own server. A router inspects the shard key in every request and directs it to the correct shard. Unlike replication (which copies the same data to multiple nodes for availability), sharding distributes different data to different nodes for capacity.

The shard key is the single most important decision in sharding. It determines how data is distributed, which queries can be served by one shard versus many, and how painful future rebalancing will be. A poor shard key creates hot shards, cross-shard queries, and operational nightmares. A good shard key matches your dominant access pattern and distributes load evenly.

What makes a good shard key

A good shard key has three properties:

High cardinality: the key should have many distinct values so data can spread across many shards. A boolean column (true/false) gives you at most 2 shards. A UUID gives you billions of possible shards.
Even distribution: the key values should occur with roughly equal frequency. Sharding by country_code puts 40% of data on the "US" shard if 40% of users are American.
Query alignment: the key should appear in the WHERE clause of your most common queries. If 90% of queries filter by user_id, shard by user_id so those queries hit one shard. Sharding by a key that your queries never filter on means every query becomes a scatter-gather.

Three core strategies

Three strategies exist for mapping keys to shards, and each trades off differently between even distribution, query locality, and operational flexibility.

Hash sharding computes hash(key) mod N to assign rows. It spreads data evenly but destroys key ordering, making range queries expensive. Modern systems replace naive mod-N with consistent hashing to avoid rehash storms during scaling. Best for point-lookup-heavy workloads like user profiles, sessions, and shopping carts.

Range sharding assigns contiguous key ranges to each shard. It preserves ordering, making time-range and alphabetical queries efficient, but risks hotspots when new data concentrates in a single range (like the "latest" time shard). Best for time-series, log aggregation, and ordered-scan workloads.

Directory sharding uses a lookup table that maps each key (or tenant) to a shard. It gives you maximum control, including the ability to isolate large tenants on dedicated shards, but introduces a lookup service that becomes a critical dependency. Best for multi-tenant SaaS with varying tenant sizes and compliance requirements.

Sharding vs. partitioning

The terms "sharding" and "partitioning" are often used interchangeably, but there is a nuance. Partitioning is the general concept of dividing data into subsets. Sharding specifically means distributing those partitions across multiple machines. PostgreSQL supports table partitioning (splitting a table into sub-tables) within a single server for query performance, but this is not sharding because all partitions live on one machine. Sharding adds the network dimension: each shard is on a separate server, introducing network latency, partial failures, and distributed coordination challenges that single-server partitioning avoids.

When NOT to shard

Sharding is irreversible complexity. Before sharding, consider whether simpler solutions suffice:

Read replicas handle read-heavy workloads by distributing reads across multiple copies of the full dataset.
Connection pooling (PgBouncer, ProxySQL) handles connection exhaustion without splitting data.
Vertical scaling (bigger machine) handles many workloads up to 1-2 TB of data.
Table partitioning (within one server) handles query performance issues on large tables.
Archiving old data to cold storage reduces the working dataset size.

Only shard when your dataset exceeds single-machine capacity, your write throughput exceeds single-node limits, or you need tenant-level isolation that replication cannot provide.

Key Insight

Sharding is a memory and storage problem before it is a throughput problem. A single PostgreSQL instance handles 50K+ transactions per second, but it cannot store 10 TB of data in 256 GB of RAM. When someone says they need sharding, ask whether they have exhausted read replicas and connection pooling first, because sharding adds permanent operational complexity that simpler solutions avoid.

Hash-Based Sharding (Partition by Hash)

Hash sharding computes shard = hash(key) % N and sends each record to the resulting shard. A good hash function distributes keys uniformly, so no single shard gets disproportionate data or traffic. The router is stateless: given the key and the hash function, any node can compute the destination without consulting a directory.

Why hash sharding works well for point lookups. When your dominant query is "get user by user_id" or "get session by session_id", hash sharding routes each request to exactly one shard with zero fan-out. The hash distributes users randomly across shards, preventing any single shard from accumulating all the traffic. This makes it the default choice for user-facing databases where the primary access pattern is "fetch one record by its primary key."

What you sacrifice: range locality. Hashing destroys key ordering. Users 100 through 200 scatter across different shards, so a query like "find all users who signed up this week" must hit every shard (scatter-gather). If range queries are your dominant pattern, hash sharding is the wrong choice. Scatter-gather queries also increase tail latency because the overall response time equals the slowest shard's response time.

Cross-shard queries and joins. Beyond range scans, hash sharding complicates any query that needs data from multiple keys. A join between a user table (sharded by user_id) and an orders table (sharded by order_id) may require fetching from different shards for each row. This is why co-location matters: if you shard both tables by user_id, a user's orders and profile land on the same shard, enabling local joins.

The mod-N problem and consistent hashing

Naive hash(key) % N works until you change N. Adding a 5th shard to a 4-shard cluster changes the modulus from 4 to 5, and hash(key) % 4 equals hash(key) % 5 for only about 20% of keys. The other 80% must migrate: a rehash storm that can take hours and degrades performance during the move.

Consistent hashing solves this by placing both shards and keys on a circular hash ring. Each key maps to the next shard clockwise on the ring. When you add a shard, it claims only the arc between itself and its predecessor. Only the keys in that arc migrate, roughly 1/N of the total. Every other key stays on its current shard.

The mathematics are straightforward: with N shards on the ring, adding one shard moves approximately 1/(N+1) of the total keys. For a 10-shard cluster, adding an 11th shard moves about 9% of keys, compared to about 91% with naive mod-N. This bounded migration cost is what makes consistent hashing practical for production systems that scale incrementally.

Virtual nodes for even distribution

Raw consistent hashing can still produce uneven distribution when you have few physical nodes. Three nodes on a ring may end up with arcs of 50%, 30%, and 20% of the key space. Virtual nodes (vnodes) fix this by placing multiple points on the ring per physical node. Cassandra defaults to 256 vnodes per node, which makes the distribution nearly uniform.

When a new physical node joins, it adds its vnodes at intervals around the ring, stealing small key ranges from every existing node rather than one large range from a single neighbor. This spreads the migration cost evenly and produces balanced load from the start.

Vnodes also handle heterogeneous hardware. A machine with 2x the RAM and CPU can own 2x the vnodes, receiving proportionally more data. Without vnodes, you would need to run two logical nodes on one physical machine, which is operationally messier.

DynamoDB's adaptive splitting

DynamoDB takes hash sharding further with adaptive capacity. When a partition approaches its throughput or size limit (10 GB), DynamoDB automatically splits it. If a single key is hot, the "split for heat" mechanism duplicates the partition so multiple nodes can serve that key's traffic. This is transparent to the application.

This matters because consistent hashing alone cannot solve the single-key hotspot problem. If one key receives 1000x the traffic of others, it still maps to one shard regardless of how evenly the ring distributes other keys. Adaptive splitting addresses this by allowing the infrastructure to respond to traffic patterns, not just key distribution.

DynamoDB also introduced on-demand capacity mode in 2018, which eliminates the need to provision throughput per table. The system automatically scales partitions based on actual traffic. This is built on top of the same adaptive splitting mechanism: as traffic increases, DynamoDB splits partitions to absorb the load, and as traffic decreases, it can merge underutilized partitions. The lesson for system design interviews: managed services increasingly automate sharding decisions that you would have to make manually with self-hosted databases.

Write amplification in hash-sharded systems

Hash sharding introduces write amplification in systems that maintain secondary indexes. When you insert a row into a hash-sharded table, the primary row goes to the shard determined by the hash of the shard key. But if the table has a global secondary index (GSI), the index entry must be written to a potentially different shard determined by the indexed column's hash. This means one logical write becomes two or more physical writes across different shards.

For example, inserting a user record sharded by user_id with a GSI on email requires writing the user row to the shard owning that user_id's hash position, plus writing the index entry to the shard owning that email's hash position. If the user table has three GSIs, one insert becomes four distributed writes. This is why DynamoDB limits tables to 20 GSIs and why Cassandra recommends minimizing secondary indexes on large tables.

Handling shard failures

When a shard fails (hardware failure, network partition, or software crash), the impact depends on whether the shard is replicated:

Replicated shard: the primary fails, and a replica is promoted to primary. During promotion (typically 10-30 seconds), writes to that shard queue or fail. Reads can continue from other replicas if configured for eventual consistency. MongoDB replica sets, Cassandra's replication, and DynamoDB's multi-AZ replication all handle this automatically.
Unreplicated shard: the shard is completely unavailable until recovery. All reads and writes to keys on that shard fail. Data may be lost if the failure is permanent and backups are stale. This is unacceptable for production systems, which is why every shard should have at least one replica.
Partial failure: the shard is reachable but slow (e.g., disk degradation). This is harder to detect than a full failure. The routing layer should implement timeouts and circuit breakers to prevent slow shards from degrading the entire system's latency.

The routing layer must distinguish between "shard is down" (fail over or return error) and "shard is slow" (timeout and retry or fail fast). Most production systems use a combination of health checks, timeouts, and circuit breakers to handle both cases.

Interview Tip

When an interviewer asks about hash sharding, always mention the mod-N rehash problem and consistent hashing as the solution. This demonstrates that you understand why naive approaches fail at scale. The follow-up question is usually about virtual nodes, so be ready to explain how vnodes distribute the migration cost evenly across all existing nodes.

When to use hash sharding

Hash sharding fits when your workload is dominated by point lookups and updates across many keys (user profiles, sessions, shopping carts) and you can tolerate scatter-gather for the occasional range query. It pairs well with multi-tenant systems where you want to prevent one large tenant from monopolizing a shard by hashing the tenant_id to spread tenants across all shards.

Common hash functions used in production include MurmurHash3 (Cassandra), MD5 (MongoDB hashed shard keys), and xxHash (high performance, used in various internal systems). The specific function matters less than its uniformity: a good hash function maps similar inputs to dissimilar outputs, preventing clustering.

Scatter-gather: the cost of cross-shard queries

When hash sharding cannot answer a query from a single shard, the router must scatter the query to all N shards and gather the results. The performance implications are significant:

Latency equals the slowest shard's response time, not the average. One slow shard delays the entire query.
Resource usage scales linearly with shard count. A 20-shard cluster uses 20x the connections and CPU for one scatter-gather query.
Tail latency compounds. If each shard has a 1% chance of a 500ms spike, a 20-shard scatter-gather has an 18% chance of hitting at least one spike.

This is why shard key selection must align with the dominant query pattern. If your most common query can be answered by one shard, you avoid scatter-gather for the majority of traffic.

Hash sharding and secondary indexes

Hash sharding complicates secondary indexes. If you shard users by user_id but need to query by email, the email index must either be local or global:

Local secondary indexes exist on each shard and index only that shard's data. A query by email must scatter-gather across all shards because the email could belong to a user on any shard. MongoDB uses local secondary indexes by default, meaning every non-shard-key query fans out to all shards.

Global secondary indexes (GSIs) maintain a single index across all shards, typically in a separate service or set of partitions. DynamoDB supports GSIs that are maintained asynchronously: when you write to the base table, DynamoDB propagates the change to the GSI partitions in the background. This means GSI reads may be slightly stale (eventually consistent) but avoid scatter-gather entirely.

The choice between local and global indexes is a consistency-vs-latency trade-off. Local indexes are always consistent with the shard's data but require scatter-gather. Global indexes avoid scatter-gather but may return stale results. For most applications, the minor staleness of a GSI is acceptable in exchange for eliminating fan-out on secondary lookups.

Shard key immutability

In most sharded systems, the shard key value of a record cannot change after insertion. If you shard by user_id, you cannot update a user's user_id without deleting the record from the old shard and inserting it on the new shard. MongoDB enforces this constraint explicitly: the shard key field is immutable. DynamoDB's partition key is also immutable. This constraint exists because changing a shard key would require the system to atomically move the record between shards, which is a distributed transaction for every update, destroying performance.

The practical implication: choose a shard key that is stable over the lifetime of the record. UUIDs, auto-incrementing IDs, and tenant IDs work well. Mutable fields like username, email, or status are poor shard keys because they change and would require cross-shard moves.

The most powerful optimization in hash sharding is co-location: sharding related tables by the same key so they land on the same shard. In an e-commerce system, if you shard both the users table and the orders table by user_id, then a user and all their orders live on the same shard. Queries like "fetch user 123 with their last 10 orders" become single-shard operations with local joins. Without co-location, the user might be on shard A while their orders scatter across shards B, C, and D.

Citus (PostgreSQL extension for sharding) makes co-location explicit: you declare which tables share a distribution key, and Citus ensures matching rows land on the same worker node. MongoDB achieves this by sharding multiple collections with the same shard key. The pattern works when your entities have a clear parent-child relationship (user-orders, tenant-data) but breaks down for many-to-many relationships where co-location is impossible.

Routing layer architecture

The routing layer sits between clients and shards and makes the shard selection for each request. Three common architectures:

Client-side routing: the application computes the shard directly. The client library knows the hash function (or caches the range map) and connects to the correct shard. This avoids a network hop but requires every client to stay synchronized with the shard topology. Cassandra uses client-side routing: the driver discovers the cluster topology and routes queries directly to the owning node.

Proxy-based routing: a dedicated proxy (mongos in MongoDB, Vitess VTGate, ProxySQL) receives all queries, determines the correct shard, and forwards the request. This centralizes routing logic and simplifies client configuration, but the proxy becomes a throughput bottleneck and single point of failure unless it is horizontally scaled.

Coordinator-based routing: the client connects to any node, and that node redirects to the correct shard if needed. DynamoDB uses this approach: the client connects to the DynamoDB service endpoint, and internal routing directs the request to the correct partition. Redis Cluster uses a variant where nodes respond with MOVED redirections, telling the client to retry against the correct node.

The routing layer must also handle shard failures gracefully. When a shard is unreachable, the router can either return an error (fail-fast), retry after a timeout, or route to a replica if the shard is replicated. Most production systems combine sharding with per-shard replication: each shard has a primary and one or more replicas, and the router fails over to a replica if the primary is down. This means the routing layer must track not just the shard topology but also the primary/replica status of each shard.

Monitoring hash distribution quality

Even with a good hash function, real-world key distributions can create imbalance. Monitor the coefficient of variation (standard deviation divided by mean) of per-shard data sizes. A CV below 0.1 indicates excellent distribution; above 0.3 indicates significant skew that may require investigation.

Common causes of skew in hash-sharded systems:

Low cardinality shard keys: if you hash a tenant_id and one tenant has 100x more data than others, hash sharding distributes tenants evenly but not data.
Hash collisions at scale: while unlikely with good hash functions, billions of keys can produce clusters of collisions on specific ring positions.
Uneven vnode placement: if vnodes are placed deterministically rather than randomly, certain ring arcs may end up with more vnodes from one physical node.

Track per-shard QPS alongside per-shard data size. Even distribution of data does not guarantee even distribution of queries. A shard holding inactive users has equal data but far fewer queries than a shard holding power users.

Consistent hashing: a deeper look

The consistent hashing ring works with any hash function that produces uniformly distributed output. The ring is conceptually a circle from 0 to 2^32 - 1 (or 2^64 - 1 for larger key spaces). Each shard is placed on the ring at position hash(shard_id), and each key is placed at position hash(key). The key is assigned to the first shard found by walking clockwise from the key's position.

When you remove a shard, its keys redistribute to the next shard clockwise. When you add a shard, it takes keys from the shard that previously owned the arc where the new shard is placed. In both cases, only the keys in the affected arc move. Every other key stays on its current shard. This property, bounded key movement on topology changes, is what makes consistent hashing essential for production sharding.

The ring also supports replication by assigning each key to the next N shards clockwise (where N is the replication factor). This means each key has N copies on N different shards, providing fault tolerance. Cassandra uses this approach: with replication factor 3, each key is stored on three consecutive nodes on the ring.

Choosing the number of shards

How many shards should you start with? This depends on your current bottleneck and growth trajectory:

Minimum viable sharding: start with the smallest number of shards that solves your immediate problem. If your database is 2 TB and each server has 512 GB of RAM, you need at least 4 shards (plus headroom for growth). Starting with 4-8 shards is typical.
Growth factor: plan for 2-3x growth before the next resharding event. If you expect to double in 18 months, start with enough shards to handle 2x your current size.
Power-of-two sizing: some systems (Redis Cluster, Vitess) work more efficiently with power-of-two shard counts because splitting a shard doubles the count. Starting with 4, 8, or 16 shards makes future splits cleaner.
Upper bound: more than 100-200 shards per cluster creates operational overhead (monitoring, migrations, connection management) that often outweighs the benefits. If you need more, consider multiple independent clusters.

The mistake is over-sharding early. 100 shards for 50 GB of data means each shard holds 500 MB, which is too small to justify the operational complexity. Under-sharding is easier to fix (add shards with consistent hashing) than over-sharding is to unwind.

Data locality and network topology

In multi-datacenter deployments, shard placement must account for network topology. Placing a shard's primary in US-East and its replicas in US-West and EU-West means:

Local reads from any region hit a nearby replica (low latency).
Writes go to the primary in US-East, incurring cross-region latency from other regions.
During a US-East outage, a replica in US-West can be promoted to primary.

Some systems support rack-aware or zone-aware placement. Cassandra's snitch configuration tells the cluster about the network topology so it can place replicas in different racks or availability zones. This prevents a single rack failure from losing all copies of a key. DynamoDB automatically distributes replicas across availability zones within a region.

For sharding specifically, you want to avoid placing all shards in one availability zone. If that zone fails, the entire sharded database is unavailable. Distribute shards (and their replicas) across at least 3 availability zones for production workloads.

Range-Based Sharding (Partition by Key Range)

Range sharding assigns contiguous key ranges to each shard. Orders from October go to Shard A, November to Shard B, December to Shard C. The key insight: adjacent keys live on the same shard, so range queries like "all orders this month" hit a single shard with no fan-out.

Why range sharding excels at scans. Time-series databases, event logs, and analytics workloads live on range queries. If your most common query is "give me all events between timestamp T1 and T2", range sharding serves it from one shard (or a few adjacent shards) instead of scatter-gathering across every shard in the cluster. This locality also benefits sequential reads: when the storage engine reads contiguous keys from one shard, it can serve them from sequential disk pages or a single memtable scan, which is dramatically faster than random reads across multiple machines.

The hotspot problem. Range sharding's biggest weakness is predictable: when all new data has keys at the end of the range (like monotonically increasing timestamps), the "latest" shard absorbs all writes while older shards sit idle. A time-series system ingesting 100K events per second funnels every write to the current hour's shard. This is not a theoretical concern: it is the single most common failure mode for range-sharded systems and the first thing an interviewer will ask about.

The metadata dependency. Unlike hash sharding, where the routing function is self-contained, range sharding requires a range-to-shard map that every router must consult. This map must be consistent across all routers: if one router thinks keys 1000-2000 live on shard A while another thinks they live on shard B, writes will split between the two shards and reads will return incomplete results. Production systems store this map in a replicated metadata service (MongoDB's config servers, Bigtable's master) and cache it aggressively in each router.

The range map is typically small (one entry per shard boundary) and changes infrequently (only during splits, merges, or migrations). This makes it an excellent candidate for aggressive caching with event-driven invalidation. MongoDB's config servers store chunk metadata and push updates to mongos routers. Bigtable's master server maintains the tablet-to-server mapping and notifies clients of changes.

Mitigating hot ranges

Pre-splitting creates empty shards for future time ranges before data arrives. Instead of one shard for "2026", create 12 shards (one per month) or 365 (one per day) at provisioning time. When January fills up, February's shard is already waiting. The cost is operational complexity: you must predict your time boundaries and create shards proactively, and if your prediction is wrong (traffic spikes in week 3 of February), you still need to split.

Hash-prefixed range keys add a small hash prefix to the key. Instead of 2026-03-06T14:30:00, store hash(event_id)[0:2] + 2026-03-06T14:30:00. This spreads writes across shards within the same time window while preserving enough ordering for coarse-grained range queries. Bigtable's documentation recommends this pattern for write-heavy time-series workloads. The trade-off is that exact-time range scans now require scanning multiple shards (one per hash prefix), though coarse-grained queries (all events in March) remain efficient.

Automatic splitting monitors shard size and throughput, splitting a hot shard at a midpoint when it exceeds thresholds. Bigtable splits tablets automatically when they grow past a configurable size. MongoDB's balancer splits chunks and migrates them in the background. The advantage over pre-splitting is that the system adapts to actual traffic patterns rather than predicted ones.

Rebalancing range shards

Range rebalancing is heavier than hash rebalancing because you move contiguous blocks of data rather than scattered keys. The typical pattern is:

Detect the hot or oversized shard via monitoring (size, QPS, latency)
Choose a split point at the key midpoint or at a natural boundary (midnight, month start)
Copy the upper half to a new shard in the background
Stream ongoing writes to both shards during migration (dual-write)
Atomic cutover by updating the range-to-shard map
Clean up the copied data from the source shard after a safety window

Real-world systems automate this: Bigtable splits online in seconds (data sits on shared storage, so splitting is mostly a metadata operation), MongoDB's balancer copies chunks in the background with an atomic metadata flip, and Vitess streams binlog to catch up before an atomic routing switch.

Merging underutilized shards

Splitting gets all the attention, but merging matters too. After a promotional event ends, the shards that handled peak traffic may be mostly empty. Keeping hundreds of barely-used shards wastes resources and complicates operations:

Each shard consumes a database connection, monitoring endpoint, and backup slot regardless of utilization.
More shards mean more scatter-gather targets for cross-shard queries.
Operational tasks (schema migrations, index builds) take longer with more shards.

Bigtable merges small adjacent tablets automatically. MongoDB supports manual chunk merging. The merge process is the reverse of splitting: copy data from one shard into its neighbor, update the range map, retire the empty shard. Merging is typically less urgent than splitting (empty shards waste money but do not cause outages), so most systems handle it during off-peak maintenance windows.

Range sharding and TTL-based expiry

Time-based range sharding pairs naturally with data retention policies. If you shard by month and your retention policy is 12 months, dropping an expired month's data is a single-shard operation: detach the shard, archive it to cold storage if needed, then delete. Compare this to hash sharding, where expired records are scattered across all shards and must be deleted row by row from each shard. This is why time-series databases (InfluxDB, TimescaleDB) use time-based range partitioning by default: the most common maintenance operation (dropping old data) becomes trivially cheap.

When to use range sharding

Range sharding fits when your dominant queries are time-range scans, ordered pagination, or geographic-prefix lookups. It is the natural choice for time-series data, log aggregation, and any workload where "give me the last N" is the most common query. Avoid it for workloads with monotonically increasing keys unless you pre-split or hash-prefix the keys.

A useful rule of thumb: if your most important query has a BETWEEN or greater-than/less-than clause on the shard key, range sharding saves you from scatter-gather. If your most important query has an equality clause (WHERE user_id = ?), hash sharding is simpler and avoids the hotspot risk entirely.

Compound range keys

You can combine dimensions in the range key to achieve both distribution and locality. For example, sharding IoT sensor data by (region, timestamp) keeps all readings from a region together (enabling regional range scans) while distributing across regions. Within each region-shard, data is ordered by timestamp for efficient time-range queries. The trade-off is that cross-region queries still require scatter-gather.

Another common pattern is (tenant_id, created_at) for multi-tenant systems with time-based queries. All of a tenant's data lives in one key range, and within that range, records are sorted by creation time. This enables efficient queries like "all orders for tenant X in March" without scatter-gather, while keeping tenant data isolated.

Bigtable's row key design is essentially compound range sharding. The recommended pattern is to structure row keys as tenant#timestamp#event_id, which creates per-tenant sorted ranges. All of a tenant's data is stored together (co-located), sorted by timestamp within the tenant's range, and uniquely identified by event_id. Scans like "all events for tenant X between T1 and T2" read a single contiguous range.

Range sharding and read amplification

Range sharding can cause read amplification when queries span multiple ranges. A query for "all orders from Q4 2025" might span 3 monthly shards (October, November, December). The router sends the query to all 3 shards, each shard scans its range, and the router merges the results. This is more efficient than scatter-gather across all shards (only 3 out of 12 shards are touched), but still involves 3 network round trips and 3 merge operations.

To minimize read amplification, align shard boundaries with common query boundaries. If most queries filter by month, make each shard hold exactly one month. If queries filter by week, use weekly shards. The goal is to make the most common query span exactly one shard.

Range sharding and write amplification

Range sharding with ordered storage (like B-trees) can suffer write amplification during splits. When a range shard splits, the underlying B-tree must be physically divided, which may involve rewriting internal pages. LSM-tree-based storage engines (used by Cassandra, RocksDB, Bigtable) handle this more gracefully because splits can be done at the file level without rewriting existing data. This is one reason why LSM-based systems dominate the range-sharded database landscape.

Hot shard detection and automatic remediation

Detecting hot shards requires real-time monitoring of per-shard metrics. The typical approach:

Collect per-shard QPS, latency percentiles, and CPU utilization at 10-second intervals.
Compute skew as the ratio of the busiest shard's QPS to the average QPS. A ratio above 2.0 indicates a hot shard.
Alert when skew exceeds the threshold for more than 5 minutes (avoiding false positives from transient spikes).
Remediate by either splitting the hot shard (range sharding), adding a cache layer for the hot keys (hash sharding), or moving the hot tenant to a dedicated shard (directory sharding).

Some systems automate this loop entirely. DynamoDB's adaptive capacity detects hot partitions and splits them without operator intervention. Bigtable's master server monitors tablet load and triggers splits automatically. For systems without built-in auto-splitting, operators build custom monitoring dashboards and runbooks for manual remediation.

Directory-Based Sharding (Lookup Service or Dynamic Partitioning)

Directory-based sharding replaces algorithmic routing with a lookup table. Instead of computing hash(key) % N or checking key ranges, the router queries a directory service: "Where does tenant_id=acme live?" The directory returns "shard 3", and the router forwards the request.

Why use a lookup table instead of a formula? Because formulas cannot express arbitrary placement decisions. With hash or range sharding, the shard assignment is deterministic: the key alone determines the shard. With directory sharding, an operator can place any key on any shard, enabling use cases that algorithmic sharding cannot support.

Multi-tenant isolation

Directory sharding's primary use case is tenant-level isolation in SaaS platforms. A large enterprise customer (generating 40% of your revenue) demands dedicated infrastructure for compliance or performance reasons. With directory sharding, you assign that tenant to a dedicated shard while smaller tenants share pooled shards. No formula change, no data migration for other tenants.

Moving a tenant between shards is a metadata update plus a data migration: copy the tenant's data to the new shard, dual-write during the transition, flip the directory entry, then clean up the source. The operation affects only one tenant and is invisible to everyone else.

Noisy neighbor isolation

In a shared shard, one tenant's heavy batch job can degrade performance for all co-located tenants. This is the "noisy neighbor" problem, and it is the primary driver for enterprise customers demanding dedicated shards. With directory sharding, you can move the noisy tenant to its own shard without affecting anyone else. You can also set per-shard resource limits (connection pools, IOPS quotas) to contain the blast radius of any single tenant's workload.

The directory as a critical dependency

The lookup table itself becomes infrastructure you must protect. Every request starts with a directory lookup, so the directory's availability and latency directly affect your entire system. Strategies to protect it:

Cache aggressively. Shard assignments change rarely (days or weeks between moves). Cache the directory in each application server's memory with a TTL of minutes. A cache miss costs one directory lookup, not one per request.
Replicate the directory. Store it in a replicated database (PostgreSQL with streaming replication, etcd, or ZooKeeper). Reads go to local replicas, writes go to the leader.
Fail-open with stale data. If the directory is unreachable, serve from the stale cache. A tenant that moved shards 5 minutes ago will get stale reads temporarily, which is better than total unavailability.
Version the directory. Include a version number in each directory entry. When a tenant's shard assignment changes, increment the version. Application servers can detect stale cache entries by comparing their cached version against the latest, enabling fast cache invalidation without TTL-based polling.

Geographic placement

Directory sharding naturally supports geographic data placement. A European tenant's data can be assigned to a shard in the EU region for GDPR compliance, while a US tenant's data stays on US shards. With hash or range sharding, you cannot control which geographic region holds a specific tenant's data without changing the hash function or range boundaries for everyone.

This is particularly important for regulations like GDPR (EU), CCPA (California), and PIPL (China) that require personal data to stay within specific geographic boundaries. The directory approach handles this cleanly:

tenant_id=acme-eu  -> shard_id=eu-west-1-shard-3
tenant_id=acme-us  -> shard_id=us-east-1-shard-7
tenant_id=acme-cn  -> shard_id=cn-north-1-shard-1

Each entry explicitly controls where data resides, and an auditor can verify compliance by inspecting the directory. Hash sharding cannot provide this guarantee because the hash function determines placement, not the operator.

When to use directory sharding

Directory sharding fits when you need per-tenant control (isolation, compliance, geographic placement), when tenant sizes vary by orders of magnitude (one tenant has 100x the data of others), or when you need to move tenants between shards without algorithmic changes. The operational cost is maintaining the directory service as critical infrastructure. In practice, many SaaS platforms start with hash or range sharding for simplicity and migrate to directory sharding when they sign their first enterprise customer who demands dedicated infrastructure.

Directory sharding implementation patterns

The directory can be implemented at several granularities:

Per-key mapping: every individual key maps to a shard. Maximum flexibility, but the directory size equals the dataset size. Only practical for low-cardinality keys like tenant_id (thousands of entries, not billions).
Per-range mapping: the directory maps key ranges to shards, similar to range sharding but with manual control over boundaries. The directory is small (one entry per range) but loses per-key placement flexibility.
Per-tenant mapping: the most common pattern in SaaS. Each tenant maps to a shard, and all data for that tenant lives on its assigned shard. The directory size equals the number of tenants, typically thousands to millions of entries, easily cacheable.

For the per-tenant pattern, the directory table is simple:

sql

1CREATE TABLE shard_directory (
2  tenant_id   VARCHAR(64) PRIMARY KEY,
3  shard_id    INT NOT NULL,
4  updated_at  TIMESTAMP DEFAULT NOW()
5);

At 50,000 tenants with 100 bytes per row, the entire directory fits in 5 MB of memory. This is small enough to cache in every application server, making the directory lookup effectively free for the vast majority of requests.

Tenant migration workflow

Moving a tenant between shards is a multi-step process that must maintain data integrity:

Mark the tenant as migrating in the directory. During this phase, reads can come from either shard.
Copy existing data from the source shard to the destination shard using a background job.
Enable dual-write so new writes go to both shards. This ensures no data is lost during the copy phase.
Verify consistency by comparing record counts and checksums between the source and destination.
Atomic cutover by updating the directory entry to point to the new shard. New reads and writes go to the destination.
Drain the source by keeping the old data for a safety window (e.g., 24-48 hours) in case rollback is needed.
Clean up by deleting the tenant's data from the source shard after the safety window.

The entire process should be automated with rollback capability at each step. A failed verification in step 4 should automatically revert to the source shard without manual intervention.

Tenant-aware connection pooling

With directory sharding, each application server may need connections to many different shards. If you have 20 shards and 50 application servers, each application server needs a connection pool to all 20 shards, totaling 1,000 connection pools across the fleet. This can exhaust database connection limits quickly.

The solution is tenant-aware lazy connection pooling: each application server only opens connections to shards it actually needs. If server A handles tenants assigned to shards 3 and 7, it only maintains connection pools to those two shards. The directory lookup determines which shard a request needs, and the connection pool for that shard is created on first use and cached. This dramatically reduces the total number of connections compared to every server connecting to every shard.

For even better connection efficiency, use a connection proxy like PgBouncer (PostgreSQL) or ProxySQL (MySQL) on each shard that multiplexes many client connections into fewer database connections. This allows hundreds of application servers to share a small pool of actual database connections per shard.

Cost analysis: directory vs. hash vs. range

Cost factor	Hash	Range	Directory
Infrastructure	N shard servers	N shard servers	N shard servers + directory service (replicated DB or etcd)
Engineering effort	Low (plug in hash function)	Medium (monitor splits, manage range map)	High (build/maintain directory, caching, failover)
Operational burden	Low (consistent hashing handles growth)	Medium (split/merge operations, hot shard detection)	High (tenant migrations, directory availability)
Query routing	Free (computed in-process)	Cheap (cached range map lookup)	Cheap (cached directory lookup, but directory must be available)
Rebalancing	Automatic with vnodes	Semi-automatic (split/merge with manual triggers)	Manual (operator-initiated tenant moves)

For startups, hash sharding with consistent hashing provides the best cost-to-complexity ratio. For enterprise SaaS platforms with compliance requirements, directory sharding's higher cost is justified by the tenant isolation it enables.

Hybrid Sharding Approaches and Real-World Comparisons

Production systems rarely use pure hash, range, or directory sharding in isolation. Most combine strategies to address multiple access patterns, and the specific combination reveals the system's priorities. Understanding how real databases implement sharding helps you make informed choices and gives you concrete examples to reference in interviews.

MongoDB: hashed range chunks

MongoDB shards collections using either range or hashed shard keys. With a hashed shard key, MongoDB hashes the key value and then range-partitions the hash space into chunks. The config server tracks chunk boundaries, and the balancer automatically migrates chunks between shards when sizes diverge. This hybrid gives you hash-like distribution with range-like chunk management. The trade-off: range queries on the original key still require scatter-gather because the hash destroys ordering.

MongoDB also supports zone sharding (formerly tag-aware sharding), which adds a directory-like overlay. You tag shards with zones (e.g., "US-East", "EU-West") and associate key ranges with zones. This lets you pin data to specific geographic regions while still using hash or range distribution within each zone. It is the closest MongoDB gets to directory-based tenant isolation.

Cassandra and DynamoDB: consistent hashing with virtual partitions

Cassandra places each partition key on a consistent hash ring with 256 virtual nodes per physical node. The partition key determines the token (ring position), and the row lands on the first node clockwise. Adding a node means the new node's vnodes steal small ranges from every existing node, producing balanced migration. DynamoDB operates similarly but hides the ring from operators, automatically splitting partitions on size or throughput thresholds.

Both systems support compound partition keys. In Cassandra, you can define PRIMARY KEY ((tenant_id, date), event_id) so that all events for a tenant on a given date land on the same partition. This is a form of co-location: the compound key controls which data is stored together, enabling efficient queries within one partition while distributing across partitions.

Vitess: range sharding with online resharding

Vitess (MySQL sharding middleware) uses range-based keyspaces but supports online resharding. To split a shard, Vitess creates target shards, copies data in the background, streams the MySQL binlog to catch up with real-time changes, then performs an atomic routing switch. No downtime, but temporary extra IO during the copy phase. The shard key is often a hash of the primary key, combining hash distribution with range management.

Bigtable and HBase: range tablets with automatic split/merge

Bigtable stores sorted rows in range tablets. The master server monitors tablet size and splits large tablets automatically. Because data sits on shared storage (GFS/Colossus), splitting is a metadata operation that takes seconds. HBase follows the same model with HDFS as the storage layer. Both systems excel at sequential scans of sorted data, making them natural fits for time-series, analytics, and log processing workloads.

The key architectural insight of Bigtable is the separation of storage and compute. Tablet servers handle queries but do not own the data files. This means:

Splitting a tablet just updates metadata (which tablet server owns which key range). No data is copied.
A failed tablet server can be replaced immediately because another server can pick up its tablets by reading from shared storage.
Compaction and garbage collection happen on the storage layer, not on the tablet servers.

This architecture is why Bigtable (and its open-source cousin HBase) can split and merge tablets in seconds, while traditional sharded databases take minutes to hours for the same operation.

CockroachDB and YugabyteDB: NewSQL sharding

NewSQL databases like CockroachDB and YugabyteDB combine range-based sharding with distributed transactions. CockroachDB automatically splits ranges when they grow past 512 MB and rebalances them across nodes. It supports distributed transactions across ranges using a variant of 2PC with a parallel commit optimization. YugabyteDB offers both hash and range sharding for different tables within the same cluster. These systems aim to provide the scalability of sharding with the ACID guarantees of a single-node database, though at the cost of higher write latency from distributed consensus.

How to discuss sharding in interviews

When an interviewer asks about sharding, structure your response around these points:

Start with why: explain what bottleneck sharding solves (dataset size, write throughput, tenant isolation). Show that you would not shard prematurely.
Choose the shard key: analyze the dominant access pattern and pick a shard key that minimizes cross-shard queries. Explain the trade-offs.
Pick the strategy: hash for point lookups, range for scans, directory for tenant isolation. Explain why this strategy fits the access pattern.
Address rebalancing: describe how you would add capacity without downtime. Mention consistent hashing, vnodes, or online resharding depending on the strategy.
Acknowledge trade-offs: cross-shard queries, distributed transactions, operational complexity. Show that you understand the costs.

Interviewers value candidates who can articulate trade-offs over candidates who simply know the mechanics. Saying "hash sharding destroys range locality, which means our analytics queries would need a separate read-optimized store" shows deeper understanding than "we use consistent hashing."

Comparison table

System	Strategy	Rebalancing	Range queries	Isolation
MongoDB	Hashed range chunks	Auto balancer migrates chunks	Scatter-gather on hashed keys	Zones for geographic placement
Cassandra	Consistent hash + vnodes	Vnodes steal ranges on join	Partition-key lookups only	Virtual datacenters
DynamoDB	Hash + auto-split	Transparent split on heat/size	Partition + sort key scans	Per-table capacity
Vitess	Range (often hashed key)	Online resharding via binlog	Efficient on unhashed keys	Keyspace isolation
Bigtable	Range tablets	Auto split/merge on size	Efficient row-range scans	Per-table clustering

Common Pitfall

Do not pick a sharding strategy based on the technology alone. MongoDB supports both range and hashed shard keys, Vitess supports hashing the range key, and DynamoDB auto-manages everything. The right question is not 'which database?' but 'what is my dominant access pattern?' Point lookups favor hash, range scans favor range, tenant isolation favors directory. The database is a tool that implements your strategy, not a strategy itself.

Cross-shard transactions

Sharding breaks single-node ACID transactions. When a transfer moves money from an account on shard A to an account on shard B, you need a distributed transaction. The two main approaches:

Two-phase commit (2PC): A coordinator asks both shards to prepare, and if both agree, commits. If either fails, both abort. 2PC guarantees atomicity but adds latency (two round trips) and reduces availability (either shard's failure blocks the transaction). Google Spanner uses 2PC with TrueTime for globally consistent transactions, but this requires specialized hardware (GPS and atomic clocks) that most systems do not have.

Saga pattern: Execute the transaction as a sequence of local transactions with compensating actions. Debit shard A, then credit shard B. If the credit fails, run a compensating transaction to refund shard A. Sagas trade atomicity for availability: there is a brief window where shard A is debited but shard B is not yet credited. For many use cases (order processing, inventory updates), this temporary inconsistency is acceptable.

The practical lesson: minimize cross-shard transactions through shard key design. Every cross-shard operation is slower, more complex, and more failure-prone than a single-shard operation. If your shard key co-locates data that participates in the same transactions, you avoid distributed coordination entirely.

Decision framework

When evaluating which strategy to use, start with these questions:

What is the dominant query pattern? If 90% of queries are point lookups by ID, hash sharding eliminates fan-out. If 90% are time-range scans, range sharding avoids scatter-gather.
How skewed is the data? If one tenant or key generates 100x the traffic of others, directory sharding (or split-for-heat) handles the imbalance. Hash sharding assumes uniform distribution.
How often will you rebalance? If you add nodes monthly, consistent hashing with vnodes minimizes migration. If you rebalance annually, even manual range splitting is tolerable.
Do you need tenant isolation? If compliance or SLA requirements demand dedicated infrastructure for specific tenants, directory sharding is the only strategy that supports arbitrary placement.
What cross-shard operations exist? If you need transactions or joins across entities, consider whether the shard key can co-locate related data. If not, plan for saga patterns or separate read-optimized stores.

Anti-patterns to avoid

Sharding by a low-cardinality key like country or status. You get at most a handful of shards, and one shard dominates. A status field with values (active, inactive, suspended) gives you 3 shards with wildly different sizes.
Changing the shard key after production data exists. This is a full data migration under live traffic. Get the shard key right before launch. If you must change it, expect the migration to take weeks and require a dedicated engineering team.
Ignoring cross-shard query cost. If 50% of your queries scatter-gather, you have effectively no sharding benefit for those queries and pay the overhead anyway. Measure your scatter-gather percentage before and after sharding.
Over-sharding. Starting with 100 shards for 10 GB of data wastes resources and adds complexity. Start with the minimum number of shards that addresses your bottleneck (often 4-8) and grow incrementally.
Assuming even distribution means even load. Equal data per shard does not mean equal queries per shard. A shard holding celebrity users receives far more reads than a shard holding inactive accounts. Monitor both storage and traffic metrics per shard.
Neglecting the routing layer. The routing layer is a critical component that must be highly available, horizontally scalable, and consistent with the shard topology. A routing layer outage takes down the entire sharded system.

Operational runbook essentials

Every sharded system needs documented procedures for these common operations:

Adding a shard: how to provision the new shard, update the topology, trigger data migration, and verify balance.
Removing a shard: how to drain data from the shard, verify it is empty, and decommission the hardware.
Splitting a hot shard: how to detect the hotspot, choose the split point, execute the split, and verify even distribution.
Handling a shard failure: how to fail over to replicas, detect data loss, and restore from backup if needed.
Running schema migrations: how to apply DDL changes across all shards without downtime (typically rolling migrations with backward-compatible schemas).
Performing backups: how to create consistent cross-shard backups for point-in-time recovery.

Without these runbooks, operational incidents on a sharded system take 3-5x longer to resolve because the on-call engineer must figure out the procedure under pressure.

Testing sharded systems

Testing a sharded system requires strategies that single-database testing does not:

Shard boundary testing: verify that records near shard boundaries (e.g., the key at the exact split point) route correctly. Off-by-one errors in range boundaries cause data to land on the wrong shard.
Cross-shard query testing: verify that scatter-gather queries return complete results. A common bug is missing results from one shard due to a timeout or connection error.
Rebalancing testing: simulate adding and removing shards in a staging environment. Verify that data migrates correctly and no records are lost or duplicated.
Failure testing: kill individual shards and verify that the system degrades gracefully (returns errors for affected data) rather than failing entirely.
Load testing per shard: generate traffic that specifically targets one shard to validate per-shard capacity limits and auto-scaling behavior.

Many teams run a staging environment with the same number of shards as production but with smaller instances. This catches sharding-specific bugs (routing errors, cross-shard join failures, migration issues) that cannot be found on a single-node development database.

Capacity planning for sharded systems

Capacity planning for a sharded system requires tracking per-shard growth rates:

Data growth rate: how fast is each shard's storage growing? If shard 3 grows at 50 GB/month and the server has 500 GB remaining, you have 10 months before it needs splitting.
Query growth rate: how fast is per-shard QPS increasing? A shard handling 5K QPS today growing at 20% month-over-month will hit 10K QPS in 4 months.
Connection growth rate: as the application fleet scales, each shard receives more connections. Monitor connection utilization against the database's max_connections setting.

Plan resharding events 2-3 months before projected capacity limits. Resharding under pressure (when a shard is already at capacity) is risky because the migration itself adds load to an already-stressed shard. Pre-emptive resharding during low-traffic periods (late night, weekends) is safer and gives you a rollback window.

The cost of getting it wrong

The consequences of poor sharding decisions compound over time:

Wrong shard key: every query that does not include the shard key becomes a scatter-gather, and changing the shard key requires a full data migration. This is the most expensive mistake because it affects every query path.
Too few shards: if you start with 2 shards and need 8, you must split 3 times. Each split is a migration event with associated risk and downtime potential.
Too many shards: 100 shards for 10 GB means each shard holds 100 MB. The overhead of managing 100 connection pools, monitoring endpoints, and backup jobs dwarfs the benefit of distributing such a small dataset.
No co-location: if related data (users and orders) lives on different shards, every join becomes a cross-shard operation. Retrofitting co-location requires migrating one or both tables.
No monitoring: without per-shard metrics, you discover hot shards when users complain, not when automated alerts fire. By then, the shard may be severely degraded.

The best defense against these mistakes is a thorough analysis of access patterns before choosing the shard key and strategy. Spend a week analyzing query logs, not an hour in a whiteboard session. The shard key decision is permanent and affects every aspect of the system's operation, performance, and growth trajectory.

Sharding and data warehousing

Sharded OLTP databases are optimized for transactional workloads: point lookups, small updates, and short transactions. Analytics queries (aggregations across millions of rows, multi-table joins, ad-hoc exploration) perform poorly on sharded systems because they scatter-gather across every shard.

The standard solution is a dual-store architecture: the sharded OLTP database handles real-time transactions, and a separate data warehouse (BigQuery, Redshift, Snowflake, ClickHouse) handles analytics. A CDC (Change Data Capture) pipeline streams changes from the sharded database to the warehouse in near real-time. The warehouse is not sharded in the same way; it uses columnar storage and MPP (massively parallel processing) to handle large scans efficiently.

This separation is not optional for large-scale systems. Attempting to run analytics on a sharded OLTP database causes scatter-gather storms that degrade transactional performance for all users. Keep OLTP and analytics workloads on separate infrastructure with separate scaling models.

Immutable data and append-only sharding

Some workloads consist entirely of immutable data: event logs, audit trails, sensor readings, financial transactions. For these workloads, sharding is simpler because:

No updates: data is written once and never modified, eliminating cross-shard update coordination.
No deletes (or only TTL-based expiry): data grows monotonically, making capacity planning predictable.
Append-only writes: new data always goes to the "latest" shard (or a small set of active shards), simplifying the write path.
Historical data is cold: older shards receive only reads, not writes, so they can be served from cheaper storage or smaller instances.

Time-based range sharding is the natural fit for immutable data. Each time period gets its own shard, old shards become read-only and can be compressed or moved to cold storage, and the "latest" shard handles all writes. This pattern is used by InfluxDB, TimescaleDB, and most log aggregation systems.

Conclusion

Sharding is not a feature you turn on. It is a permanent architectural decision that shapes your data model, query patterns, operational runbook, and failure modes for the life of the system. The three strategies, hash, range, and directory, are not competing options but complementary tools that trade off distribution, locality, and control.

Hash sharding gives you even distribution and stateless routing at the cost of range locality. Use it when point lookups dominate and you can tolerate scatter-gather for occasional range queries. Always pair it with consistent hashing and virtual nodes to avoid rehash storms during scaling.

Range sharding preserves key ordering for efficient range scans at the cost of potential hotspots. Use it for time-series, log aggregation, and ordered pagination workloads. Mitigate hotspots with pre-splitting, hash-prefixed keys, or automatic splitting.

Directory sharding gives you arbitrary placement control at the cost of maintaining a lookup service as critical infrastructure. Use it when tenant isolation, compliance requirements, or extreme data skew demand per-key placement decisions.

In practice, most production systems combine strategies. MongoDB hashes keys then range-manages chunks. Cassandra uses consistent hashing with vnodes. Vitess range-shards on hashed keys with online resharding. The right approach starts with your access pattern, not with a technology choice.

The shard key decision is the one you will live with longest. Choose it by analyzing your dominant query pattern, data distribution, and growth trajectory, because changing the shard key after data is distributed is the most expensive migration you can do.

Cross-shard operations: the hidden cost

Sharding introduces coordination overhead for any operation that spans multiple shards:

Cross-shard transactions require two-phase commit (2PC) or saga patterns, both slower and more complex than single-shard transactions.
Cross-shard joins require scatter-gather, where the router queries all shards, collects results, and merges them in memory.
Aggregation queries (COUNT, SUM, AVG) across the full dataset must scatter-gather, with each shard computing partial results that the coordinator merges.
Schema migrations must be coordinated across all shards, often requiring rolling updates with backward-compatible schemas.
Backups become per-shard operations that must be coordinated for point-in-time consistency across the cluster.

The shard key determines which operations stay single-shard (fast) and which become cross-shard (slow). This is why shard key selection matters more than any other sharding decision.

Monitoring and operational readiness

A sharded system requires monitoring that a single-node database does not:

Storage size per shard: detect imbalance before it causes performance degradation.
Query latency per shard (p50, p95, p99): identify hot shards that need splitting.
Write throughput per shard: detect write hotspots from skewed access patterns.
Cross-shard query percentage: if more than 20% of queries scatter-gather, the shard key may not match the actual access pattern.
Replication lag per shard: each shard may have its own replicas with independent lag.
Connection count per shard: detect connection pool exhaustion on individual shards.

Alert on skew: if one shard's p99 latency is 3x the cluster average, it is either hot or oversized. These metrics tell you when to split, merge, or rebalance before users notice degradation.

Schema migrations across shards

Schema migrations on a sharded database are one of the most operationally challenging tasks. On a single database, ALTER TABLE ADD COLUMN is a single operation. On 20 shards, it is 20 operations that must be coordinated. The standard approach is rolling migrations:

Deploy application code that handles both the old and new schema (backward-compatible).
Apply the migration to shards one at a time, with a pause between each to verify no errors.
Once all shards are migrated, deploy the final application code that uses the new schema exclusively.
Clean up any backward-compatibility code.

This process takes hours or days instead of minutes, but it avoids downtime. Tools like gh-ost (GitHub's Online Schema Change) and pt-online-schema-change (Percona) automate the per-shard migration with minimal locking. Vitess supports online DDL across its sharded keyspaces. The key principle: never apply a breaking schema change to all shards simultaneously, because a failure on one shard during migration must not cascade to others.

Summary decision table

Criterion	Hash	Range	Directory
Point lookups	Single shard	Single shard (if key in range)	Single shard (after lookup)
Range scans	Scatter-gather	Single shard	Scatter-gather
Even distribution	By default	Requires monitoring	Manual control
Rebalancing cost	Low (vnodes)	Medium (split/merge)	Low (metadata update)
Tenant isolation	Not supported	Not supported	Full control
Routing overhead	None (computed)	Map lookup (cached)	Directory lookup (cached)
Operational complexity	Low	Medium	High (directory service)

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios