System requirements
Functional:
- User can upload files to this system.
- User can update existing files.
- User can organize files in directories.
- User can share file or directory with other users.
- File sharing options are read-only, or read-write.
- User can edit file while being offline from server.
There are many other requirements such as search, preview, collaboration, and so on. But the above is plenty for an interview session.
[Generally speaking, you would like to keep the requirements scope small. You only have 35 - 50 min in an interview. If you have a lot of requirements, you'd risk running out of time. The central theme of this problem is how to upload and store big files and sync. You'd like to focus on that. Suggest a smaller set and get the interviewer's buy-in. If they want you to expand the scope, they would ask.]
Non-Functional:
- Data durability. Once file is uploaded, it should never get lost.
- Availability
- Scalability. Many users.
- Strong Consistency. All clients should see the same files.
- Security. Files can be accessed by the owner and users with explicit sharing permissions. By default, its't not shared by anybody other than the owner.
- Response time: reasonable. Would like to show a home directory quickly (e.g. sub second), but if upload & download takes 1 - 10s depending on file size, that would be acceptable.
[Consistency is a key requirement for this problem. As a file system, consistency requirement is high. For example, when a file is deleted, it has to be deleted for all clients. Eventual consistency, e.g., "at some point the file will be deleted for all clients", would not do. This would affect the database choice later.]
Capacity estimation
200M DAU
Each person reads twice a day, writes once a day.
Average file size: 1MB
20M new files generated per day
20TB new files per day.
File storage size would be 14.6PB in two years.
256B for file metadata (owner, version, etc.) per file.
20M * 0.25KB = 5GB / day
3.6TB in two years.
[It is important to focus on the numbers that affect the architecture. In this example, the file data size is obviously very large. This would affect the data store choice. You could spend a lot of time discussing other things like network bandwidth, response time, server capacities, but we would recommend not. As soon as you see the huge number, you'd know it will present a big enough challenge.]
API design
We will introduce the concept of chunking in the API. Chunking, i.e., splitting a large file into a fixed sized block (e.g. 4MB), is has many advantages:
- It allows parallel uploading & downloading of a large file. Modern browsers support this.
- It allows us to operate on a smaller block, instead of a larger file, for important operations like error checking and resending.
- It allows a partial update (i.e. update the changed chunk, instead of uploading the whole file again.)
[Chunking is a key idea here because it is so beneficial. It may not be a great deep dive topic, as there are not many tradeoffs you can discuss. You can still talk about it a briefly.]
- upload(user_ID, file_content, file_name, parent_directory_ID) -> returns file_ID in JSON along with some metadata. This initiates an upload process. It returns the URL to which update_chunk() should send chunks to.
- update(user_ID, file_ID, file_content) -> replaces the whole file
- upload_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> uploads a specific chunk of a file
- update_chunk(user_ID, file_ID, chunk_ID, chunk_content) -> replaces a specific chunk of a file
- download(user_ID, file_ID)
- mkdir(user_ID, parent_directory_ID, directory_name)
- preview(user_ID, file_ID) # Nice-to-have
- share(file_ID, [user_IDs], sharing_type) -> directory_ID can be used in place of file_ID. group_IDs can be used in place of user_IDs. Sharing_type has several options: read only, read-write, can upload (but not download), can comment (but not update), etc. # Nice-to-have
All APIs return HTTP error codes.
Database design
[Mid-level deep-dive topic. Identifying multiple data types, discussing their properties, and choosing the right data store types, shows a good understanding of data.]
There are two kinds of data to store:
- File contents (in chunks), and
- File metadata.
It is important to consider the two separately, because they have different characteristics.
File contents are unstructured. And they are huge. Blob Store (e.g. Amazon S3) would be a suitable choice.
File metadata are structured. They require strong consistency. For example, file deletion can be implemented by setting the file state to "deleted" in the metadata DB. This should be seen by all clients. Eventual consistency would not do. Since the size is at a manageable range, we can choose RDB.
Metadata would have:
- File name
- File version
- Location (parent directory ID)
- List to file chunks (e.g. chunk IDs in S3)
- Sharing options
- Groups of users
- What kind of access rights (e.g. read/write, read-only, comment only ...) each group has
- Owner
- Created time
- Last edited time
- File checksum
High-level design
It is interesting to discuss if we should have CDN or not.
Majority of files probably do not benefit from CDN, but some files would have CDN friendly properties: it is written once, and is read many times. For example, a famous person uploads some announcement and shares it with many users. CDN would cache this file. The many users who read it would access the copy in the CDN, instead of hitting the API Gateway. This would save resources from all the backend components we are building. So let's have CDN.
Chunking
Request flows
Let's discuss the first important flow, Upload.
- Client initiates uploading process by calling upload().
- File Upload Service writes metadata (e.g. file ID, name, location) to RDB.
- File Upload Service returns the URL of Chunk Uploader for file uploading to the client.
- Client chunks the file, and send chunks to Chunk Uploader.
- Chunk Uploader saves the chunks to Blob Store.
- After Chunk Uploader saves all the chunks, it would mark the file to be ready in Metadata DB.
Detailed component design
[Senior level deep dive topic. Offline & synching is very interesting!]
Offline Sync
Clients, especially mobile apps, will become offline often. We need to make sure the user would have smooth user experience, as much as possible.
(1) Clients should cache files (file metadata and chunks). Files and chunks should be versioned.
This allows users to keep working while offline (let's say on transit without network access).
But what happens when the client connects back?
(2) Synchronization
When the client re-connects to network after some time, the client must check if the file has been updated, either by the client itself (in this case, the client would have the newer file), or by another client (in that case, the server would have newer file). In the worst case, the client and server have both new, diverged files.
When the client detects it becomes online after some time, it connects to the server and check:
- Has any of the chunks updated on the client?
- Has any of the chunks updated on the server?
If either client or server has updated a file, while the other party has not, then the party with the newer version should send the updated chunks to the other party.
If both sides updated the document, this is a conflict, and it has to be resolved.
As a general purpose file system, our service does not have domain-specific knowledge of the format and the content of the files. It would be challenging to provide an automatic conflict resolution algorithm.
Therefore, by default it should support letting the user know a conflict has happened (e.g. making file name appear on the client in bright red), and giving user an option to choose, e.g., say the file on the client is the updated one, or the one one the server.
We should have Sync Service which handles this reconnect behavior.
Trade offs/Tech choices
An interesting trade-off is data robustness vs data privacy.
Privacy laws require that users can delete their data completely.
However, to be able to recover data from a disaster, it's important that the files are copied for backup in multiple places.
A reliable metadata store (e.g. in RDB) should remember where all the copied chunks are located, including backups. This way, when a user decides to delete the files, all the copies (copied chunks) are deleted.
Failure scenarios/bottlenecks
[Mid level deep dive topic]
Error Check
Network errors along the upload path might corrupt the chunks. The chunks can get corrupted even after being stored. It is important to make sure files are uploaded correctly, and stay correct.
- Before chunks are uploaded, the client should take checksums (e.g. SHA1) of the chunks, and send them to the server, along with the chunks themselves.
- An async job, triggered during file upload, should read the chunks from Blob Store, calculate the checksums, and make sure they match with the checksums sent in (1).
- Periodically, another async job should calculate this checksum and make sure the data stays intact. This is particularly important when data moves (e.g. recovered from a backup copy, or moved to a different data center).
Data Duplication
It is also important to replicate data for robustness.
Each uploaded file should be replicated to: one copy in same data center for a quick recovery, one copy in the same region, and finally in a data center in another region in case of a disaster. All of them should be stored as chunks with version numbers, so that only the lost or corrupted chunks can be brought back to the production system.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?