Design Pastebin - System Design

System requirements

Functional:

Upload a text file and receive a link
Store the text file at the link for 30 days

Non-Functional:

Scale to millions of users / texts per day
Consistency across horizontally scaled nodes
Durability / Fail over preventing data loss
Links are unique

Capacity estimation

**assuming 1 million DAU, 1:1 W/R ration, Peak QPS = 2 * QPS **

2 million pastes / day = 2*10^9 / 10^5 = 2*10^4

= 20,000 QPS (writes)

2 mill * 30 days = 60 million / month

**assuming 1KB file limit**

60 * 10^9 * 1KB = 60 * 10^12 = 60TB storage requirement / month

**assuming average of 5 paste reads / day**

5 million reads / day = 5 * 10^9 / 10^5 = 50,000 QPS

API design

POST /api/v1/uploadText

content-type: application/json

args: body

ret: UUID

GET /api/v1/getText?uuid=UUID

content-type: plain/text

args: UUID

ret: uploaded text (browser will render into the page)

Database design

Database is a key value store where the key is unique to the paste and the value is the text stored at that link.

60TB storage requirement isn't out of the realm of possibility for a single machine, so single leader replication could be a good fit. 20,000QPS = 1KB * 2 * 10^4 = 20MB is almost too high for one write node but is theoretically possible on high-end servers. If we needed to scale we could use a weakly consistent sharding scheme, partitioned on user_id or even grographical location if user_id doesn't exist. we can use read replicas or hash indexes to speed up reads. LSM or B-tree based indexes would be a waste as we don't need to do range queries.

That being said let's choose a document based data base, the data won't benefit from relations. Also noSQL are easier to set up sharding when needed. Let's pick cassandra for it's feature rich offerings including a consistent hashing type algorithm to distribute load evenly among commodity hardware (through the use of virtual nodes and other methods) this provides support for heterogeneity which reduces costs and mitigates the hot shard / celebrity problem.

High-level design

Write:

We need to allow the customer to upload their content. Besides uploading the file using HTTP, we also need a unique identifier to serve as the key. There are multiple ways of going about this: we can use a hash function + linear probing or buckets for collisions, unsuring uniqueness of the keys. We can also use the snowflake UUID method, where we combine information about the request to form a composite key of sorts which together uniquely identify any paste. This often includes an offset to a predetermined start time (since we are only storing for 30 days this can be T-30 days), as well as either the user_id or perhaps a checksum / hash of the content. If we use the hash of the content we can stop users from duplicating texts which would be good.

Read:

Mostly just checks if the content is caches, returns it if so, checks if the UUID is present in the DB, returns the value or an error.

Request flows

Write:

User's browser makes a POST request to our upload endpoint with the body of the request containing the text.
The load balancer will direct the request to a stateless web server, which can be easily scaled and recovered from failures since there's no state. It can also serve requests for any shard since we don't need any information about the customer from our database yet.
Proxy asks for a UUID, which will fail if the UUID already exists due to the customer duplicating a paste, and the upload service will return a 4xx to the customer, or perhaps a redirect to the page that already exists, which would be more graceful and is the better choice.
The proxy / web server that receives the request will determine which shard the user / their geographical area should be on. One downside to using geographical area is that there's a chance we need cross-shard reads if someone moves outside their area which would increase latency.
The proxy makes a call to the appropriate DB node and that node ACKs back to the upload service when the node was successfully written (we should retry a few times if not or use a queue that doesn't pop a value until it is read or exhausts the retry mechanism). When the status of the write is determined the proxy could respond with the text, but it may save on network to just return the UUID, and have the customer redirected to the "read page" which just makes a get request with the UUID to return the text which it renders. This way we only send the text over the network if the client actually wants to access it after writing, beyond just a successful confirmation.

Read:

Client GETs paste.com/UUID
Load balancer directs requests to the appropriate read service node, which now that I think of it can be combined with the upload service web tier. The combined stateless web tier can determine what calls to make for write / read just fine as a single horizontally scaled unit.
The web tier will determine this is a read request, and will check the cache then the database if the cache misses.
If the UUID key is found in the databse (not expired), we return the body of text associated with it, which the browswer will render into a page for the customer to view.

Detailed component design

Above

Trade offs/Tech choices

Above

Failure scenarios/bottlenecks

Above

Future improvements

Above