Design Pastebin - System Design

System requirements

Functional:

MVP features:

Users can past text on the website

Our system will store the text and return a randomly generated URL to the user

users can access the text content by using that URL

users can choose to expire the text file after some period (default, not expired)

We will put a size limit of 1 MB per paste

Non mvp features:

user registration

id generation

Non-Functional:

Durability - once uploaded, the text should be persisted until the expiration time

Availability - services should be highly available

scalable - services should be able to handle high traffic and support long-term storage

Capacity estimation

1 Million DAU

1: 10 write to read ratio

QPS:

write: 1_000_000 / 86400 => 1_000_000 / 1_00_000 => 10 QPS

read: 100 QPS

Peak QPS * 5 => read 500 QPS, write 50 QPS

Storage:

Files:

1_000_000 * 1MB => 1 TB / day

365 TB / year => 350 * 5 => 1500 TB for 5 years

assume 20% files expired within one year

1200TB for 5 year

replica * 3 => ~3600 TB

DB:

1_000_000 * 1 kb => 1 GB / day =

Bandwidth:

write: 10 MB / sec

read: 100 MB / sec

API design

https://pastebin.com/doc_api#2

create_paste(past_data, expire_timestamp)

returns: url contains the paste_id

get_paste(paste_id)

returns text or the path to the storage that contains the text (depends on the design)

Database design

We only have one table used as the metadata table, lets call it metadata table

metadata table has following columns

hash_key

blob_path

creation_date

expiration_date (null if never expired)

Hash_key is the primary key as well as the partition key

We dont have requirements for strong consistency here, so can use a key-value NoSQL database, like cassandra or dynamodb, as usually those database are designed to support high volume traffic.

User registration is not listed as the MVP requirement, but if we do need to support user registerations, we will need a user table, a user_paste table, and add a user_id column in the metadata table

In the read path, we can have a read-through cache to handle reads for hot keys

High-level design

Upload path

Users paste text on the web, and click submit, which triggers the create_paste API

The request firstly goes through API gateway / LB for rate_limiting / analytics checks / protocol translation

Then the request goes to the upload service. The upload service will upload the files into a blob store (like s3), and gets a hash key for this file from key generation service, puts the data in a metadata store, then returns the url with hash key to the users.

In this approach, we let users upload text to our server first, and then our server uploads the text to the blob store

Another potential option, if we want to support large files (say 1 GB of log file) in future, is that we can ask s3 for a pre-signed url first, we return the pre-signed url and the hash key to client, and let the web client uploads text file directly to the s3 path. For that, we probably need to have a status column in our metadata table, and requires the client to send a callback after the upload is finished.

Read path

After the user opens a URL with hashKey, it will post a get_paste API to our read service. Our read service will look up the metadata to see if the hashKey is valid. If valid, we will fetch the text from the path stored next to the hash key in the table, then return the content to users

Similarly, for large files in future, we can also ask s3 for a presigned url and let user read from the s3 directly with the web client.

We can use a read through cache here, to help with those popular hash keys

For popular pastes, we can cache them in CDN so the load read can go to CDN that is closer to users and route read requests from s3 to CDN

For paste with expiration date, we can set the TTL when storing the files in s3.

For the record in the metadata table, we can either set the TTL when writing the entry or do a passive approach, which means when a read comes in, we firstly check if the entry is expired, if yes, then we delete it. But this approach could potentially leave a lot of expired entries in the table.

The upload service will validate expiration with other inputs before storing in metadata table

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?