System Requirements


Functional -

  1. Users should be able to paste their text and get a paste link in return
  2. Users should be able to set the expiration time for the pasted content, including indefinite time
  3. We are not storing user data since we want them to stay anonymous


Non-Functional -

  1. Availability - We want our service to the available at all times
  2. Consistency - We are okay with the user not seeing their pasted text on the provided URL instantly. We are aiming for eventual consistency
  3. Fault Tolerant - System should be fault tolerant
  4. Latency - Latency should be fairly low
  5. Durability - We can never lose any text given by the user


Capacity Estimation

  1. We assume that each paste is 1 MB in size
  2. We assume 1 million pastes are generated every day, thus, we generate 10^12 bytes of data everyday = 1 TB data everyday
  3. Over 10 years, we would need ~4000 TB data = 4 PB
  4. Factoring in replication and caching, we would need roughly 20 PB data over 10 years
  5. We serve 5 million requests per day, read and write combined
  6. Queries per second, QPS = 5 * 10^6 / 10^5 = 50 queries served per second
  7. Peak queries per second would be 5 * QPS = 250


API Design


We would primarily need two APIs.

  1. Create Paste Link
    1. POST
    2. /paste/create
    3. Takes in the arguments in the form of a JSON like content of the paste, expiration date, title of the content etc.
    4. Returns a 4XX status code in case of an error, 2XX otherwise
  2. Get Pasted Content From Link
    1. GET
    2. /paste/fetch/:link
    3. Takes in the link as the argument and returns the pasted content for the link
    4. Returns 2XX in case of success, 4XX in case of errors


Database Design


  1. We are not storing user data since we assume they want to remain anonymous
  2. We can use a relational database to auto generate an ID that will serve as our pasteID
    1. paste
      1. pasteID Primary Key
      2. pasteLink Secondary Key
      3. createdAt
      4. expiryDate Index
      5. objectID Index
  3. We use an object store to store our content
  4. In our relational database along with the pasteID we store the object ID that our object store returns to map the paste to the pasted content
  5. To quickly search through our expired links, we add an index to the expiry data field
  6. Alternatively, we can use Redis as a sorted set which will store the expiration data of a link along with its paste ID. We keep checking the first element of the set, if we see a link that has expired, we delete the paste from our paste table. We also mark the object as deleted in our object store



High-Level design


  1. To present abuse of our system, we impose rate limiting based on IP address and bucketing strategy. This bucketing strategy will prevent distributed attacks
  2. We allow a high number of reads but a lower number of writes. Also, we let the users know that they are being rate limited along with the time when they can make their next request
  3. All read and write requests go through a load balancing layer where the load balancer assigns servers based on one of multiple available schemes, like round robin, average response time etc.
  4. Our database uses single leader replication and all of its followers act as replicas
  5. S3 handles scaling and replication on its own for us
  6. To shard our database, we can utilize consistent hashing to create hash values which will be our pasteID
  7. To avoid collision on our hashed values, we can use Base64 conversion to ensure that all the created hashed values are unique and increasing with time
  8. We also add a caching layer, we assume that 20% of our reads will be handled by the cache. Inside the cache, we the Least Recently Used eviction policy to evict pastes. For our cache, we can use Redis
  9. Since we are using MySQL, it will use two level locking to address write conflicts, where a process can only write to a row when no other process is reading that row.
  10. A process that tries to read a row while it is being written to cannot read as it has an exclusive lock on that row
  11. To avoid losing our pastes while our databases are being written, we can use a Write Ahead Log (WAL) that can rebuild our database in case of a failure
  12. All components are horizontally scaled


Request Flows


Explained in the high level design above



Detailed Component Design


Explained in the high level design above



Trade Offs/Tech Choices

  1. S3 is ideal for putting in the pasted content. It can scale horizontally and handle replication on its own



Failure Scenarios/Bottlenecks

  1. Our data store runs out of space
  2. Our database goes down
  3. Too many requests can overload our service



Future Improvements

  1. We can use a Write Ahead Log (WAL) to protect from data store failures
  2. Schedule a job to run routinely to get rid of expired paste links
  3. Support not just pasting text, but also media like photos, files and videos