Design An API Rate Limiter - System Design

System requirements

Functional:

When a request is received, determine whether or not to throttle
Throttle based on rules for each user, organization, IP Address, and operation requested.

Non-Functional:

Highly available, scale to millions of DAU.
Low latency, must not reduce main API's latency more than necessary
Reliable, no data loss in case of temporary failures

Capacity estimation

QPS Difficult to estimate without a set use case, let's say 100 billion QPD
1x10^11 / 10^5 = 1x10^6 = 1,000,000 QPS
More important is the storage estimation since we have to store a rule for each user/IP
Let's say 10 million DAU. Each user needs a map entry for it's UUID one for their IP.
key: 1 char = 1 byte, IP address <= 12 chars, userID = 64 bits = 76 bytes / key
val: Let's say we only have 10 APIs to make a rule / threshold for. a 32 bit integer should be good enough for a threshold value. Threshold represents number of allowed requests in a window or tokens in a token bucket = 32 bytes / key / operation = 320 bytes / val
Each rule is about 350 bytes, 350 x 10 ^7 = 3.5GB

API design

isThrottled(userID, operation)
return (bool): True if user has exhausted their threshold

Database design

We can use sharding by partitioning on the userID, but we may still need to increase availability. We can also replicate these shards to ensure high availability and lower latency.
Replication incurs an issue with consistency. We would trade off higher latency for a strongly consistent relationship between the replicas, but we probably don't need strong consistency. We can instead use a custom tunable eventual consistency algorithm like the gossip protocol for shard replication (which isn't offered by memcached). Each node "gossips" new updates to other replicas (fanout number is tunable) by independently notifying a subset of nodes every unit time defined by the gossip interval. These two values can tune the convergence time to avoid most underthrottling cases caused by a stale read on the token count.
Since the total size of the DB is only 3.5 GB with 10 million DAU (could be larger with more operations or DAU) it is a good candidate for in-memory storage in some sort of key value store, especially since we are trying to minimize latency. Two common choices are memcached and redis, of which I would choose memcached in this case since it can be more performant than redis in basic use cases. We also have no use for redis' rich functionality enabled by it's rich data format. Also, new versions of redis are no longer open source while memcached is, which could be an obstacle to using redis. One good reason to use redis would be the fact that it supports shard replication which we are using in this design, but we want to implement a custom eventual consistency algorithm to minimize latency.

High-level design

We assume the client has reached the web tier of some product that we wish to track throttling for. This service's web tier will call isThrottled(userID, operation) before allowing the request to be processed.
The throttle service will read from the appropriate memcached node, any replica that belongs to the partition based on userID. We can use consistent hashing for the mapping, which allows for faster inserts / deletes and rehashing.
The token bucket value is incremented by one and compared with the threshold.
If the user is over the threshold for this window, return True for is isThrottled and the web tier will return a 429 to the customer with appropriate retry headers (retry-after)
If the user is below threshold, the web tier allows the request through and eventually returns a 200 to the user.
Token bucket algo for tracking current allotment available to any user
Let's go with the token bucket algo for it's simplicity and since none of the other requirements are stated.
We might not need to log much here, the web tier would likely handle access logging and would include throttle decisions in it's logging. Let's have a monitoring service as well to determine when a failure occurs (lots / not enough throttling, throttle service returning 500 errors) and page the on-call engineers.

Request flows

Web tier recieves a request from userID to do operation
web tier makes a isThrottled(userID, operation) request
The throttle service host hashes the userID to find the memcached shard
Some replica of the user's shard will receive the request, and return the token count as well as increment.
Replica will send out it's new token count to the other replicas, who will take the new request.
The throttle service will compare the threshold to the current token count and return T/F to the webtier based on it

Detailed component design

As for the algorithm itself, we have a few options. The token bucket is tried and true, simple to implement, and resistant to microthrottling - where a customer is overthrottled due to bursty traffic on the edge of a window that otherwise is techincally within the threshold. If we wanted to be more space efficient we could use a sliding window (because no need to store refill rate), but this would have issues with accuracy when bursty traffic is present. We can use an append only log with timestamps to ensure extreme accuracy (due to the timestamps) but incur much more memory usage. If accuracy doesn't matter to us very much, we could use the rolling window log algorithm which is an approximation. It uses probability to determine the current window's usage based on extrapolation of the previous window's usage.
Token Bucket: Each token bucket consists of two integers, the current count and the bucket limit or threshold. When an operation is accessed we decrement the current count, and every bucket has a constant fill rate equal to the allowed frequency of userID to do operation. memcache nodes are a single thread so no race conditions to worry about on a single machine. The gossip protocol avoids locking by essentially allowing for a small percentage of stale reads. The idea is these numbers are typically much larger than 1 so in the case an update was missed it doesn't change the behavior of the system too much.
Memcached sharding and replication for reduced latency:
We can reshard, making the hash range for each shard lower and lower, making the number of shards theoretically infinte. This covers scaling, but introduces a SPOF for any given shard without replication.
Memcached doesn't support shard replication, but we are implementing our own in order to be as fast as possible. the gossip protocol usually operates by sending updates (list of users and the number of requests recieved). We need to make sure

Trade offs/Tech choices

Token bucket: Chosen for simplicity with potential for high latency, high burst tolerance, and good latency.
Memcached: Chosen over redis for latency purposes, and other things described above.

Failure scenarios/bottlenecks

The throttle service would be a SPOF, so we should replicate it as it is stateless.
We can use CH or RR load balancing algorithms to distribute the traffic
We can have active-active or active-passive fail over between these, where requests are retried on different nodes if they fail the first time.
If a node fails to upload a heart beat after x seconds, replace it with a follower / fellow node and bring up a new node.
Similar to the throttle service, the in memory DB nodes may fail intermittedly, so we should have similar fail over strategies implemented among the replicas of a shard.
If more permanent failures occur, it should be evident in the logs which can be used to alarm the oncall engineers since this would be a more permanent failure without an automatic fix.

Future improvements

Solutions discussed above

System requirements

Functional:

When a request is received, determine whether or not to throttle
Throttle based on rules for each user, organization, IP Address, and operation requested.

Non-Functional:

Highly available, scale to millions of DAU.
Low latency, must not reduce main API's latency more than necessary
Reliable, no data loss in case of temporary failures

Capacity estimation

QPS Difficult to estimate without a set use case, let's say 100 billion QPD
1x10^11 / 10^5 = 1x10^6 = 1,000,000 QPS
More important is the storage estimation since we have to store a rule for each user/IP
Let's say 10 million DAU. Each user needs a map entry for it's UUID one for their IP.
key: 1 char = 1 byte, IP address <= 12 chars, userID = 64 bits = 76 bytes / key
val: Let's say we only have 10 APIs to make a rule / threshold for. a 32 bit integer should be good enough for a threshold value. Threshold represents number of allowed requests in a window or tokens in a token bucket = 32 bytes / key / operation = 320 bytes / val
Each rule is about 350 bytes, 350 x 10 ^7 = 3.5GB

API design

isThrottled(userID, operation)
return (bool): True if user has exhausted their threshold

Database design

We can use sharding by partitioning on the userID, but we may still need to increase availability. We can also replicate these shards to ensure high availability and lower latency.
Replication incurs an issue with consistency. We would trade off higher latency for a strongly consistent relationship between the replicas, but we probably don't need strong consistency. We can instead use a custom tunable eventual consistency algorithm like the gossip protocol for shard replication (which isn't offered by memcached). Each node "gossips" new updates to other replicas (fanout number is tunable) by independently notifying a subset of nodes every unit time defined by the gossip interval. These two values can tune the convergence time to avoid most underthrottling cases caused by a stale read on the token count.
Since the total size of the DB is only 3.5 GB with 10 million DAU (could be larger with more operations or DAU) it is a good candidate for in-memory storage in some sort of key value store, especially since we are trying to minimize latency. Two common choices are memcached and redis, of which I would choose memcached in this case since it can be more performant than redis in basic use cases. We also have no use for redis' rich functionality enabled by it's rich data format. Also, new versions of redis are no longer open source while memcached is, which could be an obstacle to using redis. One good reason to use redis would be the fact that it supports shard replication which we are using in this design, but we want to implement a custom eventual consistency algorithm to minimize latency.

High-level design

We assume the client has reached the web tier of some product that we wish to track throttling for. This service's web tier will call isThrottled(userID, operation) before allowing the request to be processed.
The throttle service will read from the appropriate memcached node, any replica that belongs to the partition based on userID. We can use consistent hashing for the mapping, which allows for faster inserts / deletes and rehashing.
The token bucket value is incremented by one and compared with the threshold.
If the user is over the threshold for this window, return True for is isThrottled and the web tier will return a 429 to the customer with appropriate retry headers (retry-after)
If the user is below threshold, the web tier allows the request through and eventually returns a 200 to the user.
Token bucket algo for tracking current allotment available to any user
Let's go with the token bucket algo for it's simplicity and since none of the other requirements are stated.
We might not need to log much here, the web tier would likely handle access logging and would include throttle decisions in it's logging. Let's have a monitoring service as well to determine when a failure occurs (lots / not enough throttling, throttle service returning 500 errors) and page the on-call engineers.

Request flows

Web tier recieves a request from userID to do operation
web tier makes a isThrottled(userID, operation) request
The throttle service host hashes the userID to find the memcached shard
Some replica of the user's shard will receive the request, and return the token count as well as increment.
Replica will send out it's new token count to the other replicas, who will take the new request.
The throttle service will compare the threshold to the current token count and return T/F to the webtier based on it

Detailed component design

As for the algorithm itself, we have a few options. The token bucket is tried and true, simple to implement, and resistant to microthrottling - where a customer is overthrottled due to bursty traffic on the edge of a window that otherwise is techincally within the threshold. If we wanted to be more space efficient we could use a sliding window (because no need to store refill rate), but this would have issues with accuracy when bursty traffic is present. We can use an append only log with timestamps to ensure extreme accuracy (due to the timestamps) but incur much more memory usage. If accuracy doesn't matter to us very much, we could use the rolling window log algorithm which is an approximation. It uses probability to determine the current window's usage based on extrapolation of the previous window's usage.
Token Bucket: Each token bucket consists of two integers, the current count and the bucket limit or threshold. When an operation is accessed we decrement the current count, and every bucket has a constant fill rate equal to the allowed frequency of userID to do operation. memcache nodes are a single thread so no race conditions to worry about on a single machine. The gossip protocol avoids locking by essentially allowing for a small percentage of stale reads. The idea is these numbers are typically much larger than 1 so in the case an update was missed it doesn't change the behavior of the system too much.
Memcached sharding and replication for reduced latency:
We can reshard, making the hash range for each shard lower and lower, making the number of shards theoretically infinte. This covers scaling, but introduces a SPOF for any given shard without replication.
Memcached doesn't support shard replication, but we are implementing our own in order to be as fast as possible. the gossip protocol usually operates by sending updates (list of users and the number of requests recieved). We need to make sure

Trade offs/Tech choices

Token bucket: Chosen for simplicity with potential for high latency, high burst tolerance, and good latency.
Memcached: Chosen over redis for latency purposes, and other things described above.

Failure scenarios/bottlenecks

The throttle service would be a SPOF, so we should replicate it as it is stateless.
We can use CH or RR load balancing algorithms to distribute the traffic
We can have active-active or active-passive fail over between these, where requests are retried on different nodes if they fail the first time.
If a node fails to upload a heart beat after x seconds, replace it with a follower / fellow node and bring up a new node.
Similar to the throttle service, the in memory DB nodes may fail intermittedly, so we should have similar fail over strategies implemented among the replicas of a shard.
If more permanent failures occur, it should be evident in the logs which can be used to alarm the oncall engineers since this would be a more permanent failure without an automatic fix.

Future improvements

Solutions discussed above