1. Requirements

Functional Requirements

Support rate limiting based on:
- User / API key / IP / Org
- Per API endpoint
Admins can:
- Create / update / delete / view rate-limit rules
System should return:
- X-RateLimit-Limit
- X-RateLimit-Remaining
- X-RateLimit-Reset
- Retry-After
Decision system determines whether to allow or reject each request

Non-Functional Requirements

Latency: < 10ms decision time
Throughput: ~1M QPS
Scalability: Horizontally scalable
Availability: Highly available (no SPOF)
Consistency: Eventual consistency acceptable

2. Estimations

Traffic

~1M requests/sec (all pass through rate limiter)

Storage

Redis (counters)

1M users × 1000 rules × ~32 bytes ≈ 32 GB

PostgreSQL (rules)

1000 rules × ~256 bytes ≈ ~256 KB per user

👉 Storage is manageable; throughput & latency are the real challenges

Latency Budget

Backend API: ~100ms
Rate limiter budget: <10ms

3. API Design

3.1 Rule Management APIs

POST /rules
PUT /rules/{id}
DELETE /rules/{id}
GET /rules/{id}
GET /rules?scope=...

3.2 Decision API (Internal / Fallback)

POST /shouldAllow

Request:

{
"scope": "user",
"scope_value": "123",
"api": "/payments"
}

Response:

{
"allowed": true,
"remaining": 120,
"reset_time": 1710000000
}

👉 Even though sidecar handles decisions, this API helps in:

Debugging
Fallback scenarios
External integrations

4. Data Storage & Design

4.1 PostgreSQL (Persistent Rules)

Rules Table:

rule_id (UUID PK)
scope_type (user/ip/org/api_key)
scope_value
api_endpoint
algorithm (token_bucket, sliding_window)
max_requests
window_seconds
burst_size
version

Constraint:

UNIQUE(scope_type, scope_value, api_endpoint)

4.2 Redis Cluster (Distributed Counters)

Key:

rl:{hash(scope_value)}:{api}:{shard}

Value:

tokens
last_refill_timestamp

👉 Uses sharding to avoid bottlenecks

5. High-Level Architecture (HLD)

The system is divided into Data Plane (request path) and Control Plane (configuration path).

5.1 Data Plane (Request Flow)

Client sends request to API Gateway
API Gateway forwards request to RL-Sidecar
RL-Sidecar:
- Fetches rule from local cache
- Checks local token buffer
If buffer empty:
- Fetches token batch from Redis Cluster
Decision:
- If tokens available → forward to backend
- Else → return 429

5.2 Control Plane (Rule Flow)

Admin updates rule via RL-Modify Service
Rule stored in PostgreSQL
Event published to Kafka / PubSub
RL-Sidecars consume event and update local cache

5.3 Burst Traffic Handling (Gateway + RL)

API Gateway absorbs initial surge using:
- Connection limits
- Request queueing
RL-Sidecar uses Token Bucket:
- Allows bursts up to capacity
- Enforces steady rate after burst

👉 Ensures:

No sudden backend overload
Smooth traffic shaping

5.4 Stateless Scaling of RL-Sidecars

RL-Sidecars are stateless workers
They do NOT store global counters

They only maintain:

Local rule cache (replicated)
Local token buffer (temporary)

Scaling Behavior:

Each API Gateway pod has its own sidecar
Scaling API Gateway → automatically scales RL capacity

Consistency:

Redis acts as shared global state

5.5 Hot Key Skew Handling

Problem:

Popular API/user → single Redis key → hotspot

Solution:

Key sharding:

rl:{user}:{api}:shard1
rl:{user}:{api}:shard2

Requests distributed across shards
Aggregation ensures correctness

5.6 Degraded Mode Operation

Case 1: Redis Slow

Use local token buffer temporarily
Reduce dependency on Redis

Case 2: Redis Unavailable

Circuit breaker activates
Strategy:
- Critical APIs → fail-close
- Non-critical APIs → fail-open

Case 3: Rule Propagation Delay

Sidecars continue using cached rules
Eventual consistency maintained

6. Detailed Breakdown

6.1 Decision Engine (RL-Sidecar)

Runs alongside API Gateway
Uses:
- Local rule cache
- Local token buffer
Avoids network calls for most requests

6.2 Rate Limiting Algorithm

Token Bucket

Allows burst traffic up to capacity
Smooth rate limiting after burst

Formula:

tokens = min(capacity, tokens + rate × Δt)

6.3 Local Token Buffering (Critical Optimization)

Instead of:

1 Redis call per request ❌

We use:

Batch token fetch (e.g., 1000 tokens)

👉 Benefits:

Reduces Redis QPS
Improves latency
Handles bursts efficiently

6.4 Concurrency Handling

Use Redis Lua scripts
Atomic:
- Read tokens
- Update tokens

6.5 Hot Key Problem

Problem:

Popular APIs → same Redis key

Solution:

Key sharding:

rl:user123:/payments:shard1
rl:user123:/payments:shard2

6.6 Rule Propagation

Use streaming:
- Kafka / PubSub

Flow:

RL-Modify → publish event
Sidecars consume → update cache

7. Additional Considerations (Production Readiness)

7.1 Burst Traffic Handling

Token bucket allows burst up to capacity
Gateway-level protections:
- Connection limits
- Queue limits

7.2 Stateless Scaling

RL-Sidecars are stateless:
- No global state stored locally
Scale horizontally with API Gateway
Redis is single source of truth

7.3 Cache Miss & Partial Failure Handling

Rule Cache Miss

Fetch from Redis / fallback defaults

Redis Latency / Failure

Use local token buffer temporarily
Apply stricter limits

Full Redis Failure

Circuit breaker activated
Strategy:
- Critical APIs → fail-close
- Non-critical APIs → fail-open

Sidecar Restart

Warm cache via:
- Kafka replay OR
- Snapshot

7.4 Configuration Change Rollout

Each rule has a version

Flow:

Admin updates rule → new version
Event published to Kafka
Sidecars update asynchronously

Consistency:

Eventual consistency
Old + new rules coexist briefly

7.5 Clock Skew & Time Consistency

Problem:

Distributed systems → inconsistent clocks

Solution:

Use Redis server time as source of truth
Avoid fixed window algorithms
Prefer:
- Token bucket
- Sliding window

7.6 Observability

Track:

QPS
Rejection rate
Redis latency
Token exhaustion rate

Tools:

Prometheus
Grafana

8. Error Handling & Exception Scenarios

8.1 Redis Failures

Timeout / connection failure:
- Retry with backoff
- Fallback to local buffer
Persistent failure:
- Circuit breaker triggers

8.2 Kafka / Propagation Failure

Sidecars continue using last known rules
Retry consumption
No immediate system impact

8.3 Data Inconsistency

Temporary inconsistencies allowed
Eventually resolved via:
- Kafka propagation
- Redis updates

8.4 Unexpected Traffic Spikes

Gateway absorbs spike via queueing
RL enforces limits strictly

8.5 Sidecar Failure

Restart sidecar
Reload:
- Rules from Kafka
- Tokens from Redis

Final Summary

Sidecar-based design ensures ultra-low latency
Redis cluster + sharding ensures scalability
Local token buffering reduces load significantly
Streaming-based rule propagation ensures consistency
Hot key mitigation + Lua scripts ensure correctness
Multi-layer fault tolerance ensures high availability