System Design Fundamentals
Networking & APIs
Storage & Data Modeling
Partitioning, Replication & Consistency
Caching & Edge
Messaging & Streaming
Reliability & Operability
Security & Privacy
Rate Limiting: Concepts, Algorithms, and Best Practices
Why do you need rate limiting? Because without it, a single misbehaving client can take down your entire service. Rate limiting controls how many requests a client can make within a time window, protecting your infrastructure from overload, abuse, and noisy-neighbor problems.

What Rate Limiting Protects Against
- Denial-of-service attacks — A single client sending millions of requests can saturate your servers. Rate limiting caps each client's throughput so no single source can overwhelm the system.
- Noisy neighbors — In multi-tenant systems, one tenant's traffic spike can degrade performance for everyone else. Per-tenant rate limits isolate blast radius.
- Cascading failures — Without rate limits, a traffic spike propagates from the API layer to the database, cache, and downstream services. Rate limiting at the edge prevents the spike from reaching fragile backends.
- Cost control — Cloud services charge per request. A runaway script or misconfigured client can generate enormous bills overnight. Rate limits act as a financial circuit breaker.
How Rate Limiting Works (High Level)
A rate limiter sits between clients and your service — typically in an API gateway, reverse proxy, or middleware. For each incoming request, it:
- Identifies the client (by API key, IP address, user ID, or token)
- Checks the client's current request count against their allowed limit
- If under the limit: forwards the request and increments the counter
- If over the limit: returns HTTP
429 Too Many Requestswith aRetry-Afterheader
Rate Limit Response Headers
Well-designed APIs communicate rate limit state to clients via headers:
X-RateLimit-Limit— Maximum requests allowed in the current windowX-RateLimit-Remaining— Requests remaining before throttlingX-RateLimit-Reset— Unix timestamp when the window resetsRetry-After— Seconds (or date) the client should wait before retrying
These headers let well-behaved clients self-throttle before hitting 429s, reducing wasted traffic for both sides.
Rate limiting is not just about blocking bad actors. Its primary value in production is protecting your own infrastructure from your own users. A popular feature launch, a viral moment, or a misconfigured batch job from a legitimate customer can generate more traffic than any attacker. Rate limits ensure graceful degradation instead of cascading failure.
The two most fundamental rate limiting algorithms are the token bucket and the leaky bucket. Both use a "bucket" metaphor but behave differently under bursty traffic — and that difference determines which one fits your use case.

Token Bucket
The token bucket works like a prepaid balance. Tokens are added at a fixed rate (the refill rate R). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity C that limits how many tokens can accumulate — this determines the burst size.
The key insight: token bucket allows bursts up to capacity C, then enforces the average rate R. A client can send 20 requests instantly (consuming all tokens), then must wait for tokens to refill at roughly 1.67/second. This burst-then-steady behavior matches how real users interact with APIs — they often send a batch of requests, then go quiet.
Leaky Bucket
The leaky bucket works like a queue with a fixed drain rate. Requests enter the bucket (queue). The bucket processes requests at a constant rate R. If the queue is full (capacity C), new requests are dropped.
The leaky bucket produces a perfectly smooth output rate regardless of input burstiness. This is ideal for backends that cannot tolerate traffic spikes — legacy databases, payment processors, or external APIs with strict rate limits of their own.
Comparison
| Property | Token Bucket | Leaky Bucket |
| Burst behavior | Allows bursts up to capacity C | No bursts — constant drain rate |
| Output rate | Variable (bursty then steady) | Constant (smooth) |
| Queue behavior | No queue (reject immediately) | Queue up to capacity, then reject |
| Best for | User-facing APIs (better UX) | Backend protection (predictable load) |
| Implementation | Counter + timestamp | Queue + drain timer |
In interviews, lead with the trade-off. Token bucket is for user-facing APIs where burst tolerance improves UX (a page load triggers 10 requests at once). Leaky bucket is for protecting rate-sensitive backends (a payment gateway that processes exactly 50 transactions per second). Most production API gateways use token bucket because user experience matters more than perfectly smooth traffic.
Token bucket and leaky bucket define how requests flow, but they do not answer the question that product teams actually ask: "How many requests can customer X make per minute/hour/day?" Quotas and time windows map business limits to technical enforcement.

Fixed Window
The simplest approach: divide time into fixed intervals (e.g., 1-minute windows). Count requests per window. Reset the counter when the window changes.
The boundary problem: A client sends 100 requests at second 59 of window 1, then 100 more at second 0 of window 2. They sent 200 requests in 2 seconds — twice the intended rate — because the counter reset at the boundary.
Sliding Window Log
Instead of fixed boundaries, track the timestamp of every request. To check the limit, count how many timestamps fall within the last N seconds:
This eliminates the boundary problem completely — the window slides with each request. The trade-off is memory: storing a timestamp per request for high-volume APIs (100,000 requests/minute) consumes significant memory per client.
Sliding Window Counter (Hybrid)
A practical compromise: keep fixed window counters but interpolate between adjacent windows to approximate a sliding window.
This uses two counters (current and previous window) and weights the previous window by how much of it overlaps with the sliding window. Memory cost is constant per client (two integers) regardless of request volume.
Comparison
| Algorithm | Memory | Accuracy | Boundary Problem |
| Fixed window | O(1) per client | Allows 2x burst at boundary | Yes |
| Sliding window log | O(N) per client | Exact | No |
| Sliding window counter | O(1) per client | Approximate (within 1%) | Minimal |
The fixed window boundary problem is not theoretical. If your API allows 1000 requests per minute with fixed windows, a client can send 2000 requests in 2 seconds by timing their burst across the window reset. Sliding window counter eliminates this with negligible overhead — use it as the default over fixed windows.
A global rate limit (e.g., "the API handles 10,000 requests per second") does not guarantee fairness. Without per-client limits, one heavy user can consume the entire quota while others get nothing. Fairness mechanisms ensure equitable access across all clients.
Per-Key Rate Limiting
The most common fairness mechanism: assign each client an independent rate limit identified by a key. The key can be:
- API key — Identifies a registered application. Best for B2B APIs.
- User ID — Identifies an authenticated user. Best for user-facing products.
- IP address — Identifies a network source. Useful for unauthenticated traffic but unreliable behind NATs (many users share one IP).
- Composite key — Combines multiple identifiers (e.g., user_id + endpoint) for granular control.
Tiered Rate Limits
Different clients deserve different limits based on their plan, trust level, or business agreement:
| Tier | Rate Limit | Burst Capacity | Use Case |
| Free | 60/min | 10 | Evaluation, hobby projects |
| Pro | 600/min | 100 | Production applications |
| Enterprise | 6,000/min | 1,000 | High-volume integrations |
| Internal | 60,000/min | 10,000 | Service-to-service calls |
Tiered limits align cost with usage. Free-tier users get enough to evaluate the API. Enterprise customers pay for higher throughput. Internal services get generous limits since they are trusted and performance-critical.
Weighted Rate Limiting
Not all requests are equal. A search query that scans millions of rows is more expensive than a simple key lookup. Weighted rate limiting charges different "costs" per request type:
This prevents a client from consuming their entire quota on cheap endpoints and then sending one expensive query that overwhelms the backend. The weights reflect actual resource cost.
Preventing Abuse Patterns
Beyond basic rate limiting, watch for:
- Distributed abuse — An attacker uses thousands of IP addresses to stay under per-IP limits. Detect by monitoring aggregate traffic patterns, not just per-key rates.
- Credential stuffing — Rapid login attempts from rotating IPs. Rate limit per target account, not per source IP.
- Scraping — Sequential access patterns across many resources. Detect via access pattern analysis, not just rate.
Rate limiting in a single-server environment is straightforward — an in-memory counter works. But production systems run behind multiple servers, each receiving a fraction of traffic. Without coordination, each server enforces its own limit independently, allowing clients to exceed the global limit by distributing requests across servers.

Local (In-Memory) Rate Limiting
Each server maintains its own counters:
Pros: Zero network latency, no external dependency, works when Redis is down. Cons: If you have 10 servers each allowing 100 requests/sec, the actual global limit is 1000/sec. You must divide the intended global limit by the number of servers — which changes as you scale.
Distributed Rate Limiting with Redis
The standard production approach: use Redis as a centralized counter store. All servers check the same counter for each client.
Redis INCR is atomic — no race conditions even under high concurrency. The EXPIRE ensures keys are automatically cleaned up. Using a pipeline reduces round trips from 2 to 1.
Race Conditions in Distributed Rate Limiting
Even with Redis, race conditions lurk:
Check-then-set race: Two servers read the counter (99), both see it under the limit (100), both increment to 100. The actual count is 101. Solution: use atomic operations. Redis INCR returns the new value atomically — check the return value, not a separate GET.
Lua script approach for complex logic that needs to be atomic:
Lua scripts execute atomically in Redis — no other command can interleave between the INCR and EXPIRE.
Production Tools
| Tool | Type | Rate Limit Support |
| Kong | API Gateway | Token bucket, sliding window, Redis-backed |
| Envoy | Proxy/Sidecar | Local + global (via rate limit service) |
| AWS API Gateway | Managed | Token bucket per API key/stage |
| Nginx | Reverse proxy | Leaky bucket (limit_req module) |
| Istio | Service mesh | Local + Redis-backed distributed |
The check-then-set race condition is the most common bug in distributed rate limiting. Two servers read the counter, both see 99 (under limit 100), both increment, and the actual count becomes 101. Always use atomic operations (Redis INCR) and check the return value rather than reading the counter first and then incrementing separately.
Rate limiting theory maps cleanly to real production systems. Understanding how major platforms implement rate limiting reveals patterns you can apply to your own designs.
API Gateway Rate Limiting (Stripe, GitHub, Twitter)
Most public APIs implement multi-layer rate limiting:
Stripe uses a token bucket per API key with separate limits per endpoint category. Read operations (retrieving a payment) have higher limits than write operations (creating a charge). This reflects the actual resource cost — reads are cheap, writes require database transactions and downstream processing.
GitHub provides 5,000 requests per hour for authenticated users and 60 per hour for unauthenticated requests. They use a sliding window and communicate state via X-RateLimit-* headers on every response. When you hit the limit, the X-RateLimit-Reset header tells you the exact Unix timestamp when your quota refreshes.
DDoS Mitigation (Cloudflare, AWS Shield)
DDoS protection uses rate limiting at multiple layers:
- Layer 3/4: Rate limit by source IP at the network edge. Drop packets exceeding a threshold before they reach the application.
- Layer 7: Rate limit by HTTP request patterns. Distinguish between a legitimate user loading a page (10 requests) and a bot scraping (1000 requests/sec).
- Adaptive limits: Normal limits during peace time; dynamically tighter limits during detected attacks. Machine learning identifies anomalous patterns and triggers stricter enforcement.
Microservice-to-Microservice Rate Limiting
Internal services need rate limiting too. A misconfigured service or retry loop can overwhelm a downstream dependency:
- Service mesh rate limiting (Istio/Envoy): Apply rate limits at the sidecar proxy level, transparent to the application code. Limits are configured centrally and enforced at every service boundary.
- Circuit breaker + rate limit: When a downstream service starts returning errors, the circuit breaker trips and subsequent calls fail fast. Rate limiting prevents the retry storm that typically follows.
- Priority-based admission: Critical requests (user-facing checkout) get higher priority than background tasks (analytics batch job). Under load, low-priority requests are throttled first.
Rate Limiting in Message Queues
Kafka and RabbitMQ implement producer and consumer rate limiting to prevent queue saturation:
- Producer throttling: Kafka brokers can reject produce requests when the broker is overloaded (quota per client-id). This prevents a runaway producer from filling all partitions.
- Consumer rate limiting: Limit how fast a consumer pulls messages to avoid overwhelming its downstream processing. If the consumer writes to a database, the consume rate should match the database's write capacity.