System requirements


Functional:

design a remote file management system

  • user can upload files
  • support multiple file types: txt & media & etc.
  • user can view uploaded files
  • once uploaded, files can be viewed at different/multiple clients
  • the home page displays the root directory of user's file system
  • user can share files with other user
  • user needs to be authenticated & authorized
  • the other user has different permissions (view only/comment/edit)
  • user gets notified when file is updated by other
  • optional: support real-time co-editing feature on files



Non-Functional:

  • scalability, availability
  • durability, uploaded files never lost
  • eventual consistency when viewing files
  • optimize latency when reading/updating files
  • resolve conflict writes




Capacity estimation

10 mil users, room for increase.

assume 10% DAU, average 1 upload, 10 views

-> ~10 upload per second, 100 read qps


Data volume: assume each user has 1GB data on average in storage -> 10PB in total, 1m files increasing every day. If average 10MB -> 10TB increasing every day -> 3.6 PB every year




API design

thrift style API design

user wants to see all files/dirs under a dir

listFiles {

i64 requestUserId

String path // the directory path

}


response {

list itemList

}


Union FileOrDirectory {

File file

Directory directory

}


File {

i64 creatorId

String creatorName

timestamp createdAt

timestamp lastUpdatedAt

i64 lastUpdatedBy

String name

String thumbnailUrl

}


Directory {

String path

String name

i64 creatorId

String creatorName

...

}


/////////////////////

uploadFile {

String localPath

String onlinePath

i64 userId

}


response {

boolean isSuccess

}


//////////////////



Database design

user account data: user profile related data (name, picture, etc)


file <> user permission data: when a file is shared with a user, we add a row indicating the file id, user id and what's the shared permission


file metadata: file name, creator, last create/edit timestamps, directory info, display the home page (root directory), file/directory list before user clicks a specific file


file storage: S3 like storage, storing the actual file data




High-level design

flowchart TD

B["client"];

D["file uploader"];

n1["load balancer"];

n2["file management service"];

n3[("file metadata")];

n4[("file &lt;&gt; user permission data")];

n5["user account service"];

n6[("user account data")];

n7[("file storage")];

n8["notification service"];

B --> n1;

n1 --> n2;

n2 --> n3;

n2 --> n4;

n1 --> n5;

n5 --> n6;

B --> D;

D --> n7;

D --> n2;

D --> n8;

n2 --> n8;

n8 --> B;








Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

file uploader + file storage: can chunk the file into small pieces

  • subsequent updates only refresh updated pieces, no need to update the whole file, especially when the file is large
  • when uploading, can split the job to be multiple jobs of different pieces. If one job fails, only need to retry it


file metadata: not a lot of upload -> write traffic is relatively low, can use relational db to support

  • sharded by user id. Can cache the data for further performance improvement
  • search by file name/directory of a given user is not hard because we don't expect users to have too many files.


notification service:

  • when file is edited by other people, send email/push notification to owner
  • the updates can be sent to notification service via a message queue, to make the notification async from online editing flow




Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...






Failure scenarios/bottlenecks

improve durability: replicate uploaded files in multiple AZs





Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?