System requirements
Functional:
- Should have: ability to upload, download files, view list of files
- Should have: ability to sync files across devices
- Nice to have: Tracking usage limit
- Nice to have: hot/warm/cold storage tiers
- Nice to have: ability to share file with other users
Non-Functional:
- Eventual consistency - fine is file is available on other synced devices after several minutes
- Durability - must not lose file once uploaded successfully
- Highly available
- Security - mustn't be able to access someone else's files
- Scalable to handle increasing number of users, storage needs
Capacity estimation
Say 10B users, 5GB each so 50B storage.
Say activity is 10K reads a day, 5K writes. Assume reads more than writes in general.
Say 5 devices allowed so syncing is a challenge too
API design
- PutFile(userId, fileName, fileContents, folderAbsolutePath)
- GetFile(userId, fileAbsolutePath)
- ListContents(userId, pathPrefix)
Database design
Files are blob storage so that is stored on disk or SSD. Disk would be cheaper. Replicated for durability. File metadata DB will mainly have name of file, absolute path of file, creation date, update date, version, user id, storage tier, shared user list (ACL)/permissions, sync status
High-level design
Components:
- API gateway to receive requests from the client and route it to the reader/writer, load balance among hosts, throttle misbehaving clients
- FileUploader service that writes to S3 or disk. May want to split file.
- FileSyncer service that handles user device connections and syncs files. Also used for manual download.
- FileMetadata manager for CRUD to file metadata DB
Request flows
Client connects to the API gateway which in turn sends request to the file uploader which in turn sends writes to the file metadata services and to disk for writing the file. File uploader has change data capture. The synced listens to this event stream to determine if there's some new file that should be synced.
Client device connects to the API gateway which in turn sends request to the file synced to register user device as present. Long polling from device to get new files. So service keeps the connection open till timeout or if there's something to send to the user. Client opens connection again.
Detailed component design
- Disk is partitioned to help with high storage requirement. The partition is found using the file name and absolute path. We can use consistent hashing - basically disks are in a circle and a range of paths (say alphabetical for each)
Trade offs/Tech choices
- Used disk instead of SSD as it's cheaper. Also we are assuming mostly large files so read sequential is fast enough. Also replicated for durability.
Failure scenarios/bottlenecks
- Large files and partial upload/download failures - client can retry on failure. Use multipart upload. Can also chunk files prior storing and store info on parts to help with download.
- Syncing across devices - using long polling to help with this.