System requirements
Functional:
MVP features:
Users can past text on the website
Our system will store the text and return a randomly generated URL to the user
users can access the text content by using that URL
users can choose to expire the text file after some period (default, not expired)
We will put a size limit of 1 MB per paste
Non mvp features:
user registration
id generation
Non-Functional:
Durability - once uploaded, the text should be persisted until the expiration time
Availability - services should be highly available
scalable - services should be able to handle high traffic and support long-term storage
Capacity estimation
1 Million DAU
1: 10 write to read ratio
QPS:
write: 1_000_000 / 86400 => 1_000_000 / 1_00_000 => 10 QPS
read: 100 QPS
Peak QPS * 5 => read 500 QPS, write 50 QPS
Storage:
Files:
1_000_000 * 1MB => 1 TB / day
365 TB / year => 350 * 5 => 1500 TB for 5 years
assume 20% files expired within one year
1200TB for 5 year
replica * 3 => ~3600 TB
DB:
1_000_000 * 1 kb => 1 GB / day =
Bandwidth:
write: 10 MB / sec
read: 100 MB / sec
API design
https://pastebin.com/doc_api#2
create_paste(past_data, expire_timestamp)
returns: url contains the paste_id
get_paste(paste_id)
returns text or the path to the storage that contains the text (depends on the design)
Database design
We only have one table used as the metadata table, lets call it metadata table
metadata table has following columns
hash_key
blob_path
creation_date
expiration_date (null if never expired)
Hash_key is the primary key.
We dont have requirements for strong consistency here, so can use a key-value NoSQL database, like cassandra or dynamodb, as usually those database are designed to support high volume traffic.
In the read path, we can have a read-through cache to handle reads for hot keys
High-level design
Upload path
Users paste text on the web, and click submit, which triggers the create_paste API
The request firstly goes through API gateway / LB for rate_limiting / analytics checks / protocol translation
Then the request goes to the upload service. The upload service will upload the files into a blob store (like s3), and gets a hash key for this file from key generation service, puts the data in a metadata store, then returns the url with hash key to the users.
In this approach, we let users upload text to our server first, and then our server uploads the text to the blob store
Another potential option, if we want to support large files (say 1 GB of log file) in future, is that we can ask s3 for a pre-signed url first, we return the pre-signed url and the hash key to client, and let the web client uploads text file directly to the s3 path. For that, we probably need to have a status column in our metadata table, and requires the client to send a callback after the upload is finished.
Read path
After the user opens a URL with hashKey, it will post a get_paste API to our read service. Our read service will look up the metadata to see if the hashKey is valid. If valid, we will fetch the text from the path stored next to the hash key in the table, then return the content to users
Similarly, for large files in future, we can also ask s3 for a presigned url and let user read from the s3 directly with the web client.
We can use a read through cache here, to help with those popular hash keys
For paste with expiration date, we can set the TTL when storing the files in s3.
For the record in the metadata table, we can either set the TTL when writing the entry or do a passive approach, which means when a read comes in, we firstly check if the entry is expired, if yes, then we delete it. But this approach could potentially leave a lot of expired entries in the table.
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?