System Design Networking & Load Balancing

Token Bucket Rate Limiting: Bursts, Refills, and the Redis Trap

March 15, 2026

Token bucket is the rate limiter that almost every API gateway you have used runs underneath. The mental model is two numbers:

Bucket capacity B, the maximum burst size.
Refill rate R, the long-run average requests per second.

Tokens accumulate in the bucket at rate R until it is full. Each incoming request consumes one token. When the bucket is empty, you reject with 429 Too Many Requests or queue. That is the whole algorithm.

Why this shape and not the naive alternative. A fixed-window counter (allow N per second, reset every second) fails at boundaries. A client can send N requests at second 0.999 and another N at 1.001, doubling the intended rate over a 2ms window. Token bucket smooths that out because tokens refill continuously.

Token bucket also tolerates natural bursts. Real user traffic is bursty: someone opens an app, fires off five API calls in 200ms, then sits idle for 30 seconds. A leaky bucket, which smooths output to a fixed rate, would queue those five calls and add latency to every one. Token bucket lets the burst through and only rate-limits sustained abuse.

Implementation is where things get interesting. In a single process, you store (tokens, last_refill_timestamp) and update on each request. Trivial.

Across a fleet of 100 API servers, you need a shared view, or different servers will let through different fractions of the same client's traffic. The textbook answer is Redis with a Lua script that atomically reads the bucket, computes refill since last access, decrements, and returns the result. One round trip, no race conditions.

Here is where the textbook breaks. At 500K RPS across the fleet, every request now crosses the network to a single Redis instance. That Redis is your new single point of failure. CPU pegs. Latency climbs. The thing you added to protect the backend becomes the thing taking it down.

The production failure I worked through: Redis hit 95 percent CPU during a marketing campaign. P99 on the rate limit call went from 1ms to 80ms. The gateway's per-request budget did not include that headroom, so requests timed out and returned 5xxs to clients. The downstream services we were protecting sat at 20 percent CPU, completely fine.

The fix is hybrid. Each gateway pod keeps a local bucket sized for its share of traffic plus the burst. Once a second, the pod reconciles with Redis: pushes its local consumption, pulls global state, and adjusts. You accept some over-limit slop during the reconciliation window in exchange for surviving a Redis outage without taking down the API.

Key takeaway

Token bucket allows bursts up to B and a steady rate of R. Centralizing it in Redis makes that Redis your new single point of failure. Local buckets with periodic reconciliation trade exactness for survivability.

Originally posted on LinkedIn. View original.