System requirements
Functional:
MVP:
- design an IP block service that blocks malicious ips. There is a black box service with an API endpoint to check if an IP should be blocked.
- we should trust the black box service. The call to that service is expensive
- Blocked IPs will not be unblocked - i.e. no TTL for saved items
- Should be able to handle both ipv4 and ipv6
Non MVP:
- It can only block malicious IPs, not malicious users. As users may use proxy or botnet to walk around the IP services
- support multi datacenter/cross region IP blocking
- support unblocked IPs. In that case, we will need an invalidation path + may encounter some consistency issues, i.e. there will be some delays when unblocking operation take into effect (which should be ok)
Non-Functional:
Availability. Our service is the entry point, if it is down, it will cause impacts to downstream services
Scalability. Read heavy system. A lot of look-ups
low latency.
Consistency: since a blocked IP will not be unblocked, consistency should not be a big concern. it is an eventual consistent system
Capacity estimation
1 billion individual ip visit per day
1 000 0 00 000 / 100 000 = 10k QPS
5% are malicious IPs
50 000 000 = 50 million malicious IPs need to be stored
each ip 300 bytes
15 billion bytes = 15 GB
API design
check_access(ip_address)
returns granted/blockedm
Database design
KV DB,
Blocked_ip tables
IP_address (pk)
created_timestamp
Shard by ip_address
We can store ipv4 and ipv6 in same column
some KV databases like Cassandra has native support for IP values (inode)
if not supported, we can save it as an varbinary.
Another option is to store IPv4 and V6 in different tables. But that would add complexity and add more moving pieces. so we choose to store them in same column in same table
Another option is to load the data into in-memory in each instance. As data can fit into the memory of a standard instance. We can use LRU strategy to maintain 1 day of malicious IPs in memory. and use a database to store historical malicious IPs
Downside: duplicate calls to the external service for the same IP. New instance need to rebuild the in-memory table from scratch.
Good part: avoid the need for a central storage. avoid the hot spot issue (as each instance has info itself)
Duplicate call is a big concern for this approach, so we will go with the centralized data storage approach
High-level design
we choose to implement the IP blocker as a separate service from API gateway because this way can provide us more flexibility (for ex. replacing the existing API gateway), reduce the load on API gateway and separate the development/deployment process
When seeing an IP for the first time, the access control service will call the API towards IP validation service to check the IP. It the result is granted, we will grant the access, otherwise we will deny the access. We save the query result into the ip_access table
Next time, when the same IP visits again, we will firstly look up the database, if the IP exists, then we will check its access status, otherwise, we will call the IP validation service again to check if the IP.
To speed up the DB look up, we can introduce a cache before DB. In our case, access pattern should be frequent access in recent time window. So we can use write back cache. Write data to the cache first, then the cache will sync the data to DB. Other choice is read through cache, i.e. we write to DB, and load to cache when there is a miss. Considering the recent access pattern, write back cache should be better. One downside is there might be data loss (cache is crashed before writing to DB). That case should be ok here, as if the IP is not in DB, we will call the API to validate the IP
To handle large traffic, we can add more instances. For caching, we can distribute requests based on IP addresses. If there are spam attacker that cause some IP super hot, we can have some options:
1) load those IP info into the instance, so no need to look up distributed cache / DB
2) Implementing rate limiting on the client side.
3) use shadow ban. return 200 to the client
We let DNS route IP to the closest data centers. For the data sync among data centers, we can
1) let each data center maintain its own blocked IP tables
2) we broadcast a write to multiple data centers.
since the traffic is routed based on region, 1) should be better choice here
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?