Design An API Rate Limiter - System Design

System requirements

Functional:

per user based rate limiter
the limit can be configured dynamically without deployment
A distributed rate limiter rather than a local one on a single machine

Non-Functional:

high availability b/c if the server goes down the db might be under extreme pressure

latency should be very minimum, p50 should be single digit ms

Capacity estimation

300M DAU, each user request 10 times on average per day

average qps: 35000

if we use ttl to be 1hr, there will be ~125M visiting traffic during the 1hr time window. The max number of entries is ~125M by assuming each user have one entry.

user_id represented by uuid, 16 bytes per uuid, 8bytes for timestamp

24 bytes * 125M = 3000M bytes = 3GB for memory usage

API design

shouldThrottle(userID uuid)

Database design

Since the latency is very high priority, database might not meet our requirements.

Thus the solution is to use a cache of Redis + persistence option(SSD preferred).

cache schema

key: user_id

value: [epoch timestamp of each allowed traffic, ...]

High-level design

client -> apigateway -> rate limit service -> redis(with persistence option)

Request flows

client will also visit the company wise api gateway

the client side traffic visiting the desired backend application first which needs the rate limit functionalities.

the backend application has the dependency on the rate limit service that call `shouldThrottle` api.

If the api returns true, the backend application return error with specific error code.

if the api returns false, continue executing the normal backend application logic.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

if the cache is unavailable, there are two options:

let the traffic go through
fail fast on all the requests

I don't think there is a perfect answer b/c:

most of the downstream services, db can their own way of handling hot key as protection
by rejecting requests, the downstream services will be autoscaled down drastically. and when the cache is back online and read to take traffic, all downstream services is already under scale even without the retry storm

In this case, we can just use a dynamic config that allows % of traffic going through.

Failure scenarios/bottlenecks

cache might be out of capacity(most likely memory) if the volume space is too large, when we need to consider LRU cache.

When the cache is down, the disaster recovery requires most craftsmanships. Since there might be retry storm, we might always need to over provision a replacement cluster. And then later gradually scale it back down.

Depending on the use case, there might also be a scenario of having too many negative cache entries which can build up the cache volume pretty quickly.

Future improvements

can regionalize the rate limiter, since users in EU only need to to access all of the backend applications in their own zones.

add a bloom filter to handling negative caching