Rate Limiting: Concepts, Algorithms, and Best Practices

Course

System Design Fundamentals

Rate Limiting: Concepts, Algorithms, and Best Practices

Topics Covered

Introduction to Rate Limiting and Its Importance

What Rate Limiting Protects Against

How Rate Limiting Works (High Level)

Rate Limit Response Headers

Rate Limiting Algorithms: Token Bucket vs Leaky Bucket

Token Bucket

Leaky Bucket

Comparison

Concept of Quotas and Time Windowing

Fixed Window

Sliding Window Log

Sliding Window Counter (Hybrid)

Comparison

Fairness Mechanisms: Equitable Distribution and Preventing Abuse

Per-Key Rate Limiting

Tiered Rate Limits

Weighted Rate Limiting

Preventing Abuse Patterns

Implementation Considerations: Local vs Distributed and Tools

Local (In-Memory) Rate Limiting

Distributed Rate Limiting with Redis

Race Conditions in Distributed Rate Limiting

Production Tools

Real-World Examples of Rate Limiting

API Gateway Rate Limiting (Stripe, GitHub, Twitter)

DDoS Mitigation (Cloudflare, AWS Shield)

Microservice-to-Microservice Rate Limiting

Rate Limiting in Message Queues

Introduction to Rate Limiting and Its Importance

Why do you need rate limiting? Because without it, a single misbehaving client can take down your entire service. Rate limiting controls how many requests a client can make within a time window, protecting your infrastructure from overload, abuse, and noisy-neighbor problems.

What Rate Limiting Protects Against

Denial-of-service attacks: A single client sending millions of requests can saturate your servers. Rate limiting caps each client's throughput so no single source can overwhelm the system.
Noisy neighbors: In multi-tenant systems, one tenant's traffic spike can degrade performance for everyone else. Per-tenant rate limits isolate blast radius.
Cascading failures: Without rate limits, a traffic spike propagates from the API layer to the database, cache, and downstream services. Rate limiting at the edge prevents the spike from reaching fragile backends.
Cost control: Cloud services charge per request. A runaway script or misconfigured client can generate enormous bills overnight. Rate limits act as a financial circuit breaker.

How Rate Limiting Works (High Level)

A rate limiter sits between clients and your service, typically in an API gateway, reverse proxy, or middleware. For each incoming request, it:

Identifies the client (by API key, IP address, user ID, or token)
Checks the client's current request count against their allowed limit
If under the limit: forwards the request and increments the counter
If over the limit: returns HTTP 429 Too Many Requests with a Retry-After header

Rate Limit Response Headers

Well-designed APIs communicate rate limit state to clients via headers:

1HTTP/1.1 429 Too Many Requests
2Retry-After: 30
3X-RateLimit-Limit: 100
4X-RateLimit-Remaining: 0
5X-RateLimit-Reset: 1709726400

X-RateLimit-Limit: Maximum requests allowed in the current window
X-RateLimit-Remaining: Requests remaining before throttling
X-RateLimit-Reset: Unix timestamp when the window resets
Retry-After: Seconds (or date) the client should wait before retrying

These headers let well-behaved clients self-throttle before hitting 429s, reducing wasted traffic for both sides.

Key Insight

Rate limiting is not just about blocking bad actors. Its primary value in production is protecting your own infrastructure from your own users. A popular feature launch, a viral moment, or a misconfigured batch job from a legitimate customer can generate more traffic than any attacker. Rate limits ensure graceful degradation instead of cascading failure.

Rate Limiting Algorithms: Token Bucket vs Leaky Bucket

The two most fundamental rate limiting algorithms are the token bucket and the leaky bucket. Both use a "bucket" metaphor but behave differently under bursty traffic, and that difference determines which one fits your use case.

Token Bucket

The token bucket works like a prepaid balance. Tokens are added at a fixed rate (the refill rate R). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity C that limits how many tokens can accumulate: this determines the burst size.

python

1import time
2
3class TokenBucket:
4    def __init__(self, capacity, refill_rate):
5        self.capacity = capacity        # Maximum tokens (burst size)
6        self.refill_rate = refill_rate  # Tokens added per second
7        self.tokens = capacity           # Start full
8        self.last_refill = time.time()
9
10    def allow(self):
11        now = time.time()
12        # Add tokens for elapsed time
13        elapsed = now - self.last_refill
14        self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
15        self.last_refill = now
16
17        if self.tokens >= 1:
18            self.tokens -= 1
19            return True   # Request allowed
20        return False      # Rate limited
21
22# 100 requests/min with burst of 20
23bucket = TokenBucket(capacity=20, refill_rate=100/60)

The key insight: token bucket allows bursts up to capacity C, then enforces the average rate R. A client can send 20 requests instantly (consuming all tokens), then must wait for tokens to refill at roughly 1.67/second. This burst-then-steady behavior matches how real users interact with APIs: they often send a batch of requests, then go quiet.

Leaky Bucket

The leaky bucket works like a queue with a fixed drain rate. Requests enter the bucket (queue). The bucket processes requests at a constant rate R. If the queue is full (capacity C), new requests are dropped.

python

1import time
2from collections import deque
3
4class LeakyBucket:
5    def __init__(self, capacity, drain_rate):
6        self.capacity = capacity      # Maximum queue size
7        self.drain_rate = drain_rate  # Requests processed per second
8        self.queue = deque()
9        self.last_drain = time.time()
10
11    def allow(self):
12        now = time.time()
13        # Drain processed requests
14        elapsed = now - self.last_drain
15        drain_count = int(elapsed * self.drain_rate)
16        for _ in range(min(drain_count, len(self.queue))):
17            self.queue.popleft()
18        if drain_count > 0:
19            self.last_drain = now
20
21        if len(self.queue) < self.capacity:
22            self.queue.append(now)
23            return True   # Request queued (allowed)
24        return False      # Queue full (rejected)

The leaky bucket produces a perfectly smooth output rate regardless of input burstiness. This is ideal for backends that cannot tolerate traffic spikes: legacy databases, payment processors, or external APIs with strict rate limits of their own.

Comparison

Property	Token Bucket	Leaky Bucket
Burst behavior	Allows bursts up to capacity C	No bursts, constant drain rate
Output rate	Variable (bursty then steady)	Constant (smooth)
Queue behavior	No queue (reject immediately)	Queue up to capacity, then reject
Best for	User-facing APIs (better UX)	Backend protection (predictable load)
Implementation	Counter + timestamp	Queue + drain timer

Interview Tip

In interviews, lead with the trade-off. Token bucket is for user-facing APIs where burst tolerance improves UX (a page load triggers 10 requests at once). Leaky bucket is for protecting rate-sensitive backends (a payment gateway that processes exactly 50 transactions per second). Most production API gateways use token bucket because user experience matters more than perfectly smooth traffic.

Concept of Quotas and Time Windowing

Token bucket and leaky bucket define how requests flow, but they do not answer the question that product teams actually ask: "How many requests can customer X make per minute/hour/day?" Quotas and time windows map business limits to technical enforcement.

Fixed Window

The simplest approach: divide time into fixed intervals (e.g., 1-minute windows). Count requests per window. Reset the counter when the window changes.

python

1import time
2
3class FixedWindowCounter:
4    def __init__(self, limit, window_sec):
5        self.limit = limit
6        self.window_sec = window_sec
7        self.count = 0
8        self.window_start = int(time.time() / window_sec) * window_sec
9
10    def allow(self):
11        now = time.time()
12        current_window = int(now / self.window_sec) * self.window_sec
13
14        if current_window != self.window_start:
15            # New window -- reset counter
16            self.window_start = current_window
17            self.count = 0
18
19        if self.count < self.limit:
20            self.count += 1
21            return True
22        return False
23
24# 100 requests per minute
25limiter = FixedWindowCounter(limit=100, window_sec=60)

The boundary problem: A client sends 100 requests at second 59 of window 1, then 100 more at second 0 of window 2. They sent 200 requests in 2 seconds, twice the intended rate, because the counter reset at the boundary.

Sliding Window Log

Instead of fixed boundaries, track the timestamp of every request. To check the limit, count how many timestamps fall within the last N seconds:

python

1class SlidingWindowLog:
2    def __init__(self, limit, window_sec):
3        self.limit = limit
4        self.window_sec = window_sec
5        self.timestamps = []
6
7    def allow(self):
8        now = time.time()
9        cutoff = now - self.window_sec
10
11        # Remove expired timestamps
12        self.timestamps = [ts for ts in self.timestamps if ts > cutoff]
13
14        if len(self.timestamps) < self.limit:
15            self.timestamps.append(now)
16            return True
17        return False

This eliminates the boundary problem completely: the window slides with each request. The trade-off is memory: storing a timestamp per request for high-volume APIs (100,000 requests/minute) consumes significant memory per client.

Sliding Window Counter (Hybrid)

A practical compromise: keep fixed window counters but interpolate between adjacent windows to approximate a sliding window.

python

1class SlidingWindowCounter:
2    def __init__(self, limit, window_sec):
3        self.limit = limit
4        self.window_sec = window_sec
5        self.prev_count = 0
6        self.curr_count = 0
7        self.curr_window = 0
8
9    def allow(self):
10        now = time.time()
11        window = int(now / self.window_sec)
12        position = (now % self.window_sec) / self.window_sec
13
14        if window != self.curr_window:
15            self.prev_count = self.curr_count
16            self.curr_count = 0
17            self.curr_window = window
18
19        # Weighted sum: previous window's contribution decreases over time
20        estimated = self.prev_count * (1 - position) + self.curr_count
21        if estimated < self.limit:
22            self.curr_count += 1
23            return True
24        return False

This uses two counters (current and previous window) and weights the previous window by how much of it overlaps with the sliding window. Memory cost is constant per client (two integers) regardless of request volume.

Comparison

Algorithm	Memory	Accuracy	Boundary Problem
Fixed window	O(1) per client	Allows 2x burst at boundary	Yes
Sliding window log	O(N) per client	Exact	No
Sliding window counter	O(1) per client	Approximate (within 1%)	Minimal

Common Pitfall

The fixed window boundary problem is not theoretical. If your API allows 1000 requests per minute with fixed windows, a client can send 2000 requests in 2 seconds by timing their burst across the window reset. Sliding window counter eliminates this with negligible overhead. Use it as the default over fixed windows.

Fairness Mechanisms: Equitable Distribution and Preventing Abuse

A global rate limit (e.g., "the API handles 10,000 requests per second") does not guarantee fairness. Without per-client limits, one heavy user can consume the entire quota while others get nothing. Fairness mechanisms ensure equitable access across all clients.

Per-Key Rate Limiting

The most common fairness mechanism: assign each client an independent rate limit identified by a key. The key can be:

API key: Identifies a registered application. Best for B2B APIs.
User ID: Identifies an authenticated user. Best for user-facing products.
IP address: Identifies a network source. Useful for unauthenticated traffic but unreliable behind NATs (many users share one IP).
Composite key: Combines multiple identifiers (e.g., user_id + endpoint) for granular control.

python

1def rate_limit_middleware(request):
2    # Extract rate limit key
3    key = f"ratelimit:{request.api_key}:{request.endpoint}"
4
5    # Check per-key limit
6    if not rate_limiter.allow(key):
7        return Response(status=429, headers={
8            "Retry-After": str(rate_limiter.retry_after(key)),
9            "X-RateLimit-Limit": str(rate_limiter.limit_for(key)),
10            "X-RateLimit-Remaining": str(rate_limiter.remaining(key)),
11        })
12
13    return forward_to_backend(request)

Tiered Rate Limits

Different clients deserve different limits based on their plan, trust level, or business agreement:

Tier	Rate Limit	Burst Capacity	Use Case
Free	60/min	10	Evaluation, hobby projects
Pro	600/min	100	Production applications
Enterprise	6,000/min	1,000	High-volume integrations
Internal	60,000/min	10,000	Service-to-service calls

Tiered limits align cost with usage. Free-tier users get enough to evaluate the API. Enterprise customers pay for higher throughput. Internal services get generous limits since they are trusted and performance-critical.

Weighted Rate Limiting

Not all requests are equal. A search query that scans millions of rows is more expensive than a simple key lookup. Weighted rate limiting charges different "costs" per request type:

python

1ENDPOINT_WEIGHTS = {
2    "/api/users/{id}": 1,       # Simple lookup
3    "/api/search": 5,           # Expensive query
4    "/api/export": 20,          # Very expensive bulk operation
5    "/api/healthcheck": 0,      # Free (monitoring)
6}
7
8def weighted_rate_limit(request, bucket):
9    weight = ENDPOINT_WEIGHTS.get(request.path, 1)
10    if bucket.tokens >= weight:
11        bucket.tokens -= weight
12        return True
13    return False

This prevents a client from consuming their entire quota on cheap endpoints and then sending one expensive query that overwhelms the backend. The weights reflect actual resource cost.

Preventing Abuse Patterns

Beyond basic rate limiting, watch for:

Distributed abuse: An attacker uses thousands of IP addresses to stay under per-IP limits. Detect by monitoring aggregate traffic patterns, not just per-key rates.
Credential stuffing: Rapid login attempts from rotating IPs. Rate limit per target account, not per source IP.
Scraping: Sequential access patterns across many resources. Detect via access pattern analysis, not just rate.

Implementation Considerations: Local vs Distributed and Tools

Rate limiting in a single-server environment is straightforward: an in-memory counter works. But production systems run behind multiple servers, each receiving a fraction of traffic. Without coordination, each server enforces its own limit independently, allowing clients to exceed the global limit by distributing requests across servers.

Local (In-Memory) Rate Limiting

Each server maintains its own counters:

python

1# Simple in-memory rate limiter per server
2from collections import defaultdict
3import time
4
5class LocalRateLimiter:
6    def __init__(self, limit_per_server, window_sec):
7        self.limit = limit_per_server
8        self.window = window_sec
9        self.counters = defaultdict(lambda: {"count": 0, "start": 0})
10
11    def allow(self, key):
12        now = time.time()
13        entry = self.counters[key]
14        if now - entry["start"] > self.window:
15            entry["count"] = 0
16            entry["start"] = now
17        if entry["count"] < self.limit:
18            entry["count"] += 1
19            return True
20        return False

Pros: Zero network latency, no external dependency, works when Redis is down. Cons: If you have 10 servers each allowing 100 requests/sec, the actual global limit is 1000/sec. You must divide the intended global limit by the number of servers, which changes as you scale.

Distributed Rate Limiting with Redis

The standard production approach: use Redis as a centralized counter store. All servers check the same counter for each client.

python

1import redis
2import time
3
4r = redis.Redis()
5
6def distributed_rate_limit(key, limit, window_sec):
7    """Sliding window counter using Redis."""
8    pipe = r.pipeline()
9    now = time.time()
10    window_key = f"rl:{key}:{int(now / window_sec)}"
11
12    pipe.incr(window_key)
13    pipe.expire(window_key, window_sec * 2)  # Keep for 2 windows
14    result = pipe.execute()
15
16    current_count = result[0]
17    return current_count <= limit

Redis INCR is atomic: no race conditions even under high concurrency. The EXPIRE ensures keys are automatically cleaned up. Using a pipeline reduces round trips from 2 to 1.

Race Conditions in Distributed Rate Limiting

Even with Redis, race conditions lurk:

Check-then-set race: Two servers read the counter (99), both see it under the limit (100), both increment to 100. The actual count is 101. Solution: use atomic operations. Redis INCR returns the new value atomically: check the return value, not a separate GET.

python

1def atomic_rate_limit(key, limit, window_sec):
2    """Race-condition-free rate limiting with atomic INCR."""
3    window_key = f"rl:{key}:{int(time.time() / window_sec)}"
4    count = r.incr(window_key)
5    if count == 1:
6        r.expire(window_key, window_sec + 1)
7    return count <= limit

Lua script approach for complex logic that needs to be atomic:

lua

1-- Redis Lua script: atomic sliding window check
2local key = KEYS[1]
3local limit = tonumber(ARGV[1])
4local window = tonumber(ARGV[2])
5local current = redis.call("INCR", key)
6if current == 1 then
7    redis.call("EXPIRE", key, window)
8end
9if current > limit then
10    return 0  -- rejected
11end
12return 1  -- allowed

Lua scripts execute atomically in Redis: no other command can interleave between the INCR and EXPIRE.

Production Tools

Tool	Type	Rate Limit Support
Kong	API Gateway	Token bucket, sliding window, Redis-backed
Envoy	Proxy/Sidecar	Local + global (via rate limit service)
AWS API Gateway	Managed	Token bucket per API key/stage
Nginx	Reverse proxy	Leaky bucket (limit_req module)
Istio	Service mesh	Local + Redis-backed distributed

Key Insight

The check-then-set race condition is the most common bug in distributed rate limiting. Two servers read the counter, both see 99 (under limit 100), both increment, and the actual count becomes 101. Always use atomic operations (Redis INCR) and check the return value rather than reading the counter first and then incrementing separately.

Real-World Examples of Rate Limiting

Rate limiting theory maps cleanly to real production systems. Understanding how major platforms implement rate limiting reveals patterns you can apply to your own designs.

API Gateway Rate Limiting (Stripe, GitHub, Twitter)

Most public APIs implement multi-layer rate limiting:

Stripe uses a token bucket per API key with separate limits per endpoint category. Read operations (retrieving a payment) have higher limits than write operations (creating a charge). This reflects the actual resource cost: reads are cheap, writes require database transactions and downstream processing.

GitHub provides 5,000 requests per hour for authenticated users and 60 per hour for unauthenticated requests. They use a sliding window and communicate state via X-RateLimit-* headers on every response. When you hit the limit, the X-RateLimit-Reset header tells you the exact Unix timestamp when your quota refreshes.

DDoS Mitigation (Cloudflare, AWS Shield)

DDoS protection uses rate limiting at multiple layers:

Layer 3/4: Rate limit by source IP at the network edge. Drop packets exceeding a threshold before they reach the application.
Layer 7: Rate limit by HTTP request patterns. Distinguish between a legitimate user loading a page (10 requests) and a bot scraping (1000 requests/sec).
Adaptive limits: Normal limits during peace time; dynamically tighter limits during detected attacks. Machine learning identifies anomalous patterns and triggers stricter enforcement.

Microservice-to-Microservice Rate Limiting

Internal services need rate limiting too. A misconfigured service or retry loop can overwhelm a downstream dependency:

Service mesh rate limiting (Istio/Envoy): Apply rate limits at the sidecar proxy level, transparent to the application code. Limits are configured centrally and enforced at every service boundary.
Circuit breaker + rate limit: When a downstream service starts returning errors, the circuit breaker trips and subsequent calls fail fast. Rate limiting prevents the retry storm that typically follows.
Priority-based admission: Critical requests (user-facing checkout) get higher priority than background tasks (analytics batch job). Under load, low-priority requests are throttled first.

Rate Limiting in Message Queues

Kafka and RabbitMQ implement producer and consumer rate limiting to prevent queue saturation:

Producer throttling: Kafka brokers can reject produce requests when the broker is overloaded (quota per client-id). This prevents a runaway producer from filling all partitions.
Consumer rate limiting: Limit how fast a consumer pulls messages to avoid overwhelming its downstream processing. If the consumer writes to a database, the consume rate should match the database's write capacity.

Course

System Design Fundamentals

Networking & APIs

Storage & Data Modeling

Partitioning, Replication & Consistency

Caching & Edge

Messaging & Streaming

Reliability & Operability

Security & Privacy

Common Interview Scenarios