Design Pastebin - System Design

System Requirements

Functional -

Users should be able to paste their text and get a paste link in return
Users should be able to set the expiration time for the pasted content, including indefinite time
We are not storing user data since we want them to stay anonymous

Non-Functional -

Availability - We want our service to the available at all times
Consistency - We are okay with the user not seeing their pasted text on the provided URL instantly. We are aiming for eventual consistency
Fault Tolerant - System should be fault tolerant
Latency - Latency should be fairly low
Durability - We can never lose any text given by the user

Capacity Estimation

We assume that each paste is 1 MB in size
We assume 1 million pastes are generated every day, thus, we generate 10^12 bytes of data everyday = 1 TB data everyday
Over 10 years, we would need ~4000 TB data = 4 PB
Factoring in replication and caching, we would need roughly 20 PB data over 10 years
We serve 5 million requests per day, read and write combined
Queries per second, QPS = 5 * 10^6 / 10^5 = 50 queries served per second
Peak queries per second would be 5 * QPS = 250

API Design

We would primarily need two APIs.

Create Paste Link
1. POST
2. /paste/create
3. Takes in the arguments in the form of a JSON like content of the paste, expiration date, title of the content etc.
4. Returns a 4XX status code in case of an error, 2XX otherwise
Get Pasted Content From Link
1. GET
2. /paste/fetch/:link
3. Takes in the link as the argument and returns the pasted content for the link
4. Returns 2XX in case of success, 4XX in case of errors

Database Design

We are not storing user data since we assume they want to remain anonymous
We can use a relational database to auto generate an ID that will serve as our pasteID
1. paste
  1. pasteID Primary Key
  2. pasteLink Secondary Key
  3. createdAt
  4. expiryDate Index
  5. objectID Index
We use an object store to store our content
In our relational database along with the pasteID we store the object ID that our object store returns to map the paste to the pasted content
To quickly search through our expired links, we add an index to the expiry data field
Alternatively, we can use Redis as a sorted set which will store the expiration data of a link along with its paste ID. We keep checking the first element of the set, if we see a link that has expired, we delete the paste from our paste table. We also mark the object as deleted in our object store

High-Level design

To present abuse of our system, we impose rate limiting based on IP address and bucketing strategy. This bucketing strategy will prevent distributed attacks
We allow a high number of reads but a lower number of writes. Also, we let the users know that they are being rate limited along with the time when they can make their next request
All read and write requests go through a load balancing layer where the load balancer assigns servers based on one of multiple available schemes, like round robin, average response time etc.
Our database uses single leader replication and all of its followers act as replicas
S3 handles scaling and replication on its own for us
To shard our database, we can utilize consistent hashing to create hash values which will be our pasteID
To avoid collision on our hashed values, we can use Base64 conversion to ensure that all the created hashed values are unique and increasing with time
We also add a caching layer, we assume that 20% of our reads will be handled by the cache. Inside the cache, we the Least Recently Used eviction policy to evict pastes. For our cache, we can use Redis
Since we are using MySQL, it will use two level locking to address write conflicts, where a process can only write to a row when no other process is reading that row.
A process that tries to read a row while it is being written to cannot read as it has an exclusive lock on that row
To avoid losing our pastes while our databases are being written, we can use a Write Ahead Log (WAL) that can rebuild our database in case of a failure
All components are horizontally scaled

Request Flows

Explained in the high level design above

Detailed Component Design

Explained in the high level design above

Trade Offs/Tech Choices

S3 is ideal for putting in the pasted content. It can scale horizontally and handle replication on its own

Failure Scenarios/Bottlenecks

Our data store runs out of space
Our database goes down
Too many requests can overload our service

Future Improvements

We can use a Write Ahead Log (WAL) to protect from data store failures
Schedule a job to run routinely to get rid of expired paste links
Support not just pasting text, but also media like photos, files and videos

System Requirements

Functional -

Users should be able to paste their text and get a paste link in return
Users should be able to set the expiration time for the pasted content, including indefinite time
We are not storing user data since we want them to stay anonymous

Non-Functional -

Availability - We want our service to the available at all times
Consistency - We are okay with the user not seeing their pasted text on the provided URL instantly. We are aiming for eventual consistency
Fault Tolerant - System should be fault tolerant
Latency - Latency should be fairly low
Durability - We can never lose any text given by the user

Capacity Estimation

We assume that each paste is 1 MB in size
We assume 1 million pastes are generated every day, thus, we generate 10^12 bytes of data everyday = 1 TB data everyday
Over 10 years, we would need ~4000 TB data = 4 PB
Factoring in replication and caching, we would need roughly 20 PB data over 10 years
We serve 5 million requests per day, read and write combined
Queries per second, QPS = 5 * 10^6 / 10^5 = 50 queries served per second
Peak queries per second would be 5 * QPS = 250

API Design

We would primarily need two APIs.

Create Paste Link
1. POST
2. /paste/create
3. Takes in the arguments in the form of a JSON like content of the paste, expiration date, title of the content etc.
4. Returns a 4XX status code in case of an error, 2XX otherwise
Get Pasted Content From Link
1. GET
2. /paste/fetch/:link
3. Takes in the link as the argument and returns the pasted content for the link
4. Returns 2XX in case of success, 4XX in case of errors

Database Design

We are not storing user data since we assume they want to remain anonymous
We can use a relational database to auto generate an ID that will serve as our pasteID
1. paste
  1. pasteID Primary Key
  2. pasteLink Secondary Key
  3. createdAt
  4. expiryDate Index
  5. objectID Index
We use an object store to store our content
In our relational database along with the pasteID we store the object ID that our object store returns to map the paste to the pasted content
To quickly search through our expired links, we add an index to the expiry data field
Alternatively, we can use Redis as a sorted set which will store the expiration data of a link along with its paste ID. We keep checking the first element of the set, if we see a link that has expired, we delete the paste from our paste table. We also mark the object as deleted in our object store

High-Level design

To present abuse of our system, we impose rate limiting based on IP address and bucketing strategy. This bucketing strategy will prevent distributed attacks
We allow a high number of reads but a lower number of writes. Also, we let the users know that they are being rate limited along with the time when they can make their next request
All read and write requests go through a load balancing layer where the load balancer assigns servers based on one of multiple available schemes, like round robin, average response time etc.
Our database uses single leader replication and all of its followers act as replicas
S3 handles scaling and replication on its own for us
To shard our database, we can utilize consistent hashing to create hash values which will be our pasteID
To avoid collision on our hashed values, we can use Base64 conversion to ensure that all the created hashed values are unique and increasing with time
We also add a caching layer, we assume that 20% of our reads will be handled by the cache. Inside the cache, we the Least Recently Used eviction policy to evict pastes. For our cache, we can use Redis
Since we are using MySQL, it will use two level locking to address write conflicts, where a process can only write to a row when no other process is reading that row.
A process that tries to read a row while it is being written to cannot read as it has an exclusive lock on that row
To avoid losing our pastes while our databases are being written, we can use a Write Ahead Log (WAL) that can rebuild our database in case of a failure
All components are horizontally scaled

Request Flows

Explained in the high level design above

Detailed Component Design

Explained in the high level design above

Trade Offs/Tech Choices

S3 is ideal for putting in the pasted content. It can scale horizontally and handle replication on its own

Failure Scenarios/Bottlenecks

Our data store runs out of space
Our database goes down
Too many requests can overload our service

Future Improvements

We can use a Write Ahead Log (WAL) to protect from data store failures
Schedule a job to run routinely to get rid of expired paste links
Support not just pasting text, but also media like photos, files and videos