Codemia | Master System Design Interviews Through Active Practice

My Solution for Design an API Rate Limiter with Score: 8/10

by dynamo7055

System requirements

Functional:

The requests from clients should be limited to servers
The limit rules should be flexible for different use cases, like IP address, user id, etc.
The rate limiter should be able to work in distributed environment
The user should receive notification when the request is throttled.
the limited request should be either discarded or forwarded to a queue for later processing which depends on use cases.

Non-Functional:

High availability: the rate limiter should be high available to limit requests from clients
low latency: the rate limiter should be in low latency when deciding if limiting a request
High error tolerance: if the rate limiter failed, it should not affect the entire system
monitoring, alerts and logging

Rate Limiting Algorithms:

We have the following rate limiting algorithms.

Token Bucket
We feed tokens to a bucket in a predefined rate. One request consumes one token. When token runs out, the request will be throttled.
For different scenario, we can set the bucket size and feeding rate.
+ easy implementation
+ handle burst requests scenario
- hard to tune the bucket size and feeding rate
Leaking Bucket
the requests are added into a queue. The consumer processes the requests from the queue in a fixed rate.
+ memory efficient as the queue size is fixed.
- not able to handle burst requests.
- hard to turn the queue size and processing rate.
Fixed Window Counter
In a fixed time window, each user has a unique counter. The rate limiter increases the counter when it receive a request. If the counter reached to the limit, the request is throttled.
+ easy implementation.
+ memory efficient
- the spike in the traffic at the end of an interval may cause more requests than allowed quota.
Sliding Window Log
log the request and its timestamp. When a request received, count the total number of requests received in the passed time window and decide if throttle the request.
+ accurate throttling.
- memory not efficient
Sliding Window Counter
When a request arrives in the current time window (e.g. at 30% timestamp in the current window), use the total number of requests in the previous time window to estimate the number of request in the last 70% interval of time window. Then plus the number of requests in the first 30% time of the current time window to decide if the current request exceeds the allowed quota.
+ memory efficient
+ Acceptable in accuracy (between sliding window log and fixed window counter)
- complicated in implementation.

By comparing the above algorithm and their cons / pros, we decide to use the token bucket algorithm for our rate limiter as it is easy implementation and able to handle burst traffic. In actual practice, the algorithm can be dynamic to fit different use cases.

Database design

userID: String

IpAddress: String

timestamp: DateTime

serviceName: String

High-level design

The clients (web browser / mobile app) sends a request to rate limiter middleware.

the rate limiter middleware decides if the request should be limited

if limited, discard the request.

if not limited, pass the request to the API servers.

Flow Steps:

When a client sends a request to a server, the request is sent to API Gateway first.
Rate limiting rules are stored in disk as config files. Workers frequently pulls the rate limiting rules from disk and load it to rule cache.
API gateway get the counter from the memory (redis) and the rule from the rule cache. it decides if the request is passed or throttled.
If the request is passed, it is redirected to the API servers.
If the request is throttled, it returns a response to the client with the response code 429 (too many requests). In some cases, the request can be pushed to a queue for later processing.

Rate Limiting Rule:

The rate limiting rules can be stored in disk as config files. For example:

domain: auth

descriptor:

key: auth_type

value: login

rate_limit:

unit: minute

request_per_unit: 5

Rate Limiting in Distributed System:

Two problems:

Race Condition:
In high concurrent environment, two API gateways read the same counter from memory, they increase the counter by 1 and write it back to memory cache.
Solution:
add a lock on the counter
- decrease the performance
lua script or sorted set data structure in Redis.
Synchronization:
a client can send requests to different servers. If the user sessions do not synchronized between the servers, the client could send more requests than the rate limiting rule allowed.
Solution is to use the centralized data store to store the user session data.
client1 -> server1 -> centralized data store
client2 -> server2 -> centralized data store
Performance Optimization:
In distributed environment, user should access the closest data center to send the requests.

Monitoring, Alert and Logging

Monitoring:
the rate limiting algorithm is effective
the rate limiting rules are effective
Load the data to metrics and use dashboard to visualize and alert to take actions when it is not healthy.
Logging can help to quickly find out the issue.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

co-located with servers vs distributed system
co-located with servers:
+ simple
- not scalable of coupling. We need to spin up the whole service just to scale the rate limiting.
distributed system:
+ decouple. independently scaling for servers and rate limiter
- additional latency probably introduced.
Where to store the counter? Disk vs Memory
Disk:
+ persistent (not necessary)
- high latency
In memory:
+ fast and support time-based expiration strategy
Redis is a good choice: INCR and EXPIRE
INCR: increase by 1 for the stored counter
EXPIRE: it sets a timeout for the counter. If the timeout expired, the counter will be cleared.

Failure scenarios/bottlenecks

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Hard rate limiting vs soft rate limiting:

If the server load is not heavy and a client requests has reached the limit, we can allow extra quota for this client to improve the overall performance.