Design Dropbox - System Design

System requirements

Functional:

MVP:

design an IP block service that blocks malicious ips. There is a black box service with an API endpoint to check if an IP should be blocked.
we should trust the black box service. The call to that service is expensive
Blocked IPs will not be unblocked - i.e. no TTL for saved items
Should be able to handle both ipv4 and ipv6

Non MVP:

It can only block malicious IPs, not malicious users. As users may use proxy or botnet to walk around the IP services
support multi datacenter/cross region IP blocking
support unblocked IPs. In that case, we will need an invalidation path + may encounter some consistency issues, i.e. there will be some delays when unblocking operation take into effect (which should be ok)

Non-Functional:

Availability. Our service is the entry point, if it is down, it will cause impacts to downstream services

Scalability. Read heavy system. A lot of look-ups

low latency.

Consistency: since a blocked IP will not be unblocked, consistency should not be a big concern. it is an eventual consistent system

Capacity estimation

1 billion individual ip visit per day

1 000 0 00 000 / 100 000 = 10k QPS

5% are malicious IPs

50 000 000 = 50 million malicious IPs need to be stored

each ip 300 bytes

15 billion bytes = 15 GB

API design

check_access(ip_address)

returns granted/blockedm

Database design

KV DB,

Blocked_ip tables

IP_address (pk)

created_timestamp

Shard by ip_address

We can store ipv4 and ipv6 in same column

some KV databases like Cassandra has native support for IP values (inode)

if not supported, we can save it as an varbinary.

Another option is to store IPv4 and V6 in different tables. But that would add complexity and add more moving pieces. so we choose to store them in same column in same table

Another option is to load the data into in-memory in each instance. As data can fit into the memory of a standard instance. We can use LRU strategy to maintain 1 day of malicious IPs in memory. and use a database to store historical malicious IPs

Downside: duplicate calls to the external service for the same IP. New instance need to rebuild the in-memory table from scratch.

Good part: avoid the need for a central storage. avoid the hot spot issue (as each instance has info itself)

Duplicate call is a big concern for this approach, so we will go with the centralized data storage approach

High-level design

we choose to implement the IP blocker as a separate service from API gateway because this way can provide us more flexibility (for ex. replacing the existing API gateway), reduce the load on API gateway and separate the development/deployment process

When seeing an IP for the first time, the access control service will call the API towards IP validation service to check the IP. It the result is granted, we will grant the access, otherwise we will deny the access. We save the query result into the ip_access table

Next time, when the same IP visits again, we will firstly look up the database, if the IP exists, then we will check its access status, otherwise, we will call the IP validation service again to check if the IP.

To speed up the DB look up, we can introduce a cache before DB. In our case, access pattern should be frequent access in recent time window. So we can use write back cache. Write data to the cache first, then the cache will sync the data to DB. Other choice is read through cache, i.e. we write to DB, and load to cache when there is a miss. Considering the recent access pattern, write back cache should be better. One downside is there might be data loss (cache is crashed before writing to DB). That case should be ok here, as if the IP is not in DB, we will call the API to validate the IP

To handle large traffic, we can add more instances. For caching, we can distribute requests based on IP addresses. If there are spam attacker that cause some IP super hot, we can have some options:

1) load those IP info into the instance, so no need to look up distributed cache / DB

2) Implementing rate limiting on the client side.

3) use shadow ban. return 200 to the client

We let DNS route IP to the closest data centers. For the data sync among data centers, we can

1) let each data center maintain its own blocked IP tables

2) we broadcast a write to multiple data centers.

since the traffic is routed based on region, 1) should be better choice here

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?