System requirements
Functional:
The core functionlity we will focus on for dropbox is file upload, get file and delete file.
Non-Functional:
We care less about latency and more about availability in this system. To have a more available system, using systems like zookeeper to make sure our load balancers are working can be a plus. Creating multiple database and S3 bucket copies.
Capacity estimation
Assuming we have 200 million total users and about 20 million active users daily. If on average each user is using about 10 gigabytes of data in their lifetyle. We will have 200 million * 10gb which is 200 * 1000 * 1000 * 10gb = 2000 petabytes of total data. All this data would have to be cloned in other data centers for availability purposes, so it can actually be 8000 petabytes of total data.
Lets say each user uploads twice per month and reads data 10 times per month. And each file is on average 500mb. Then we have 20 million * 0.5gb = 20 * 1000 * 1000 * 0.5 = 10 petabytes of data per read/write. So daily we are reading 100 petabytes of data and writing 20 petabytes.
API design
We want to have a POST request https://drive/user_id/create
which will return 201 success and 400 fail. indicating that the upload has began. From there we probably want to create a websocket connection that will allow us to get a notification once the upload is complete. If a file is over 5 gigabytes, we can return a 413 error to the user.
We want to have a GET request https://drive/video_id which will return 200 with { video_url: string, user_id }
We want to have a DELETE request https://drive/video_id which will return 200 success or 404 not found if the video_id did no exist.
Database design
Video schema could be
{
video_id primary,
created_at,
video_url,
name,
description,
users: []
}
and our actual folder structure
{
_id,
name:
type:
children: []
}
In our example here we want to use content editable storage so that the data gets broken up into smaller bits maybe 4mb before we store them in our s3 bucket. We then create an id based on this content and store that in our nosql database for reference to the data. We also store which users will have access to that data block. Therefore our s3 bucket here will not write to a cdn naturally and users have to access the file through the server.
We will create read copies of our nosql database so that we reduce latency on the reads at least.
We are using nosql instead of sql database because it will allow us to store the folder structure graph easily as its just a JSON object. Also since we are looking for primarily a key value pair and don't necessarily need any joins, this can be useful.
High-level design
We will have a load balancer to distribute the server load. But assuming we need multiple load balancers and we want to make sure we have high availability of data, we will need something like zookeeper which will check the heart beat and distribute traffic to the correct load balancer.
In our example here we want to use content editable storage so that the data gets broken up into smaller bits maybe 4mb before we store them in our s3 bucket. We then create an id based on this content and store that in our nosql database for reference to the data. We also store which users will have access to that data block. Breaking data into smaller chunks means that if any data already exists in our S3 bucket, a new user will not have to upload that chunk and this way we can share storage globally.
Request flows
Client hits the load balancer which will distribute the load to the server. We also have zookeeper which checks if the load balancer is currently working, and if not it sends a notification so that the config is updated in real time excluding said load balancer and routing traffic through a load balancer that is working.
The server then breaks the video into 4b chunks lets say and uses content editable storage to create an id. As the file is being uploaded we push a task to our message queue, and once the file finishes, we pop the queue to then write the id in our no sql database, alongside updating the users array with this user.
Detailed component design
For content addressable storage we can use something like sha256 to actually create the hash.
For how the folders are stored in hierarchy we can use adjacency list type graphs stored in our no sql database
We will have a load balancer to distribute the server load. But assuming we need multiple load balancers and we want to make sure we have high availability of data, we will need something like zookeeper which will check the heart beat and distribute traffic to the correct load balancer.
Trade offs/Tech choices
We are using S3 bucket instead of file storage. Although file storage may seem intuitive for something like dropbox, it will be harder to scale as we want to avoid storing duplicate data.
Since our current system is avoiding the usage of CDNS since the data needs to be only viewed by said user, we are sacrificing latency for security.
Using nosql database is trading consistency for latency, but we are primarily using it so that we can easily store the folder heirarchy represented in this case by a graph
Failure scenarios/bottlenecks
We have SPOF. We need many S3 buckets to make sure the data is available. This can create more latency on the system, so these S3 buckets need to be replicated in the background without affecting read/writes to the original database.
To handle large load, we can create background workers that upload to our S3 bucket and push to message queue, from there we can also push to our sql databases after the file uploads. We also want to handle multi file uploads so that everything is being uploaded in one shot. By having many sql databases as well we can write to different copies and then make sure they are eventually consistent. This means that some data might be replicated but unlikely as that would mean 2 users would have to upload the same file chunk at the same time. But this will reduce overall latency so its worth the risk.
We also need some garbage collection so that when data is deleted, altough we remove it from our nosql database, we want to make sure we only remove it from s3 bucket if its not used in our nosql database at all. So we can have. background worker that handles this.
Future improvements
We have SPOF. We need many S3 buckets to make sure the data is available. This can create more latency on the system, so these S3 buckets need to be replicated in the background without affecting read/writes to the original database.
To handle large load, we can create background workers that upload to our S3 bucket and push to message queue, from there we can also push to our sql databases after the file uploads. We also want to handle multi file uploads so that everything is being uploaded in one shot. By having many sql databases as well we can write to different copies and then make sure they are eventually consistent. This means that some data might be replicated but unlikely as that would mean 2 users would have to upload the same file chunk at the same time. But this will reduce overall latency so its worth the risk.
We also need some garbage collection so that when data is deleted, altough we remove it from our nosql database, we want to make sure we only remove it from s3 bucket if its not used in our nosql database at all. So we can have. background worker that handles this.