System requirements
Functional:
design a remote file management system
- user can upload files
- support multiple file types: txt & media & etc.
- user can view uploaded files
- once uploaded, files can be viewed at different/multiple clients
- the home page displays the root directory of user's file system
- user can share files with other user
- user needs to be authenticated & authorized
- the other user has different permissions (view only/comment/edit)
- user gets notified when file is updated by other
- optional: support real-time co-editing feature on files
Non-Functional:
- scalability, availability
- durability, uploaded files never lost
- eventual consistency when viewing files
- optimize latency when reading/updating files
- resolve conflict writes
Capacity estimation
10 mil users, room for increase.
assume 10% DAU, average 1 upload, 10 views
-> ~10 upload per second, 100 read qps
Data volume: assume each user has 1GB data on average in storage -> 10PB in total, 1m files increasing every day. If average 10MB -> 10TB increasing every day -> 3.6 PB every year
API design
thrift style API design
user wants to see all files/dirs under a dir
listFiles {
i64 requestUserId
String path // the directory path
}
response {
list
}
Union FileOrDirectory {
File file
Directory directory
}
File {
i64 creatorId
String creatorName
timestamp createdAt
timestamp lastUpdatedAt
i64 lastUpdatedBy
String name
String thumbnailUrl
}
Directory {
String path
String name
i64 creatorId
String creatorName
...
}
/////////////////////
uploadFile {
String localPath
String onlinePath
i64 userId
}
response {
boolean isSuccess
}
//////////////////
Database design
user account data: user profile related data (name, picture, etc)
file <> user permission data: when a file is shared with a user, we add a row indicating the file id, user id and what's the shared permission
file metadata: file name, creator, last create/edit timestamps, directory info, display the home page (root directory), file/directory list before user clicks a specific file
file storage: S3 like storage, storing the actual file data
High-level design
flowchart TD
B["client"];
D["file uploader"];
n1["load balancer"];
n2["file management service"];
n3[("file metadata")];
n4[("file <> user permission data")];
n5["user account service"];
n6[("user account data")];
n7[("file storage")];
n8["notification service"];
B --> n1;
n1 --> n2;
n2 --> n3;
n2 --> n4;
n1 --> n5;
n5 --> n6;
B --> D;
D --> n7;
D --> n2;
D --> n8;
n2 --> n8;
n8 --> B;
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
file uploader + file storage: can chunk the file into small pieces
- subsequent updates only refresh updated pieces, no need to update the whole file, especially when the file is large
- when uploading, can split the job to be multiple jobs of different pieces. If one job fails, only need to retry it
file metadata: not a lot of upload -> write traffic is relatively low, can use relational db to support
- sharded by user id. Can cache the data for further performance improvement
- search by file name/directory of a given user is not hard because we don't expect users to have too many files.
notification service:
- when file is edited by other people, send email/push notification to owner
- the updates can be sent to notification service via a message queue, to make the notification async from online editing flow
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
improve durability: replicate uploaded files in multiple AZs
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?