System requirements


Functional:

  • Users can store texts
  • Texts can be shared via a unique link
  • Texts have a defined lifetime


Non-Functional:

  • Displaying the text using the link must be fast (< 250ms)
  • Strong consistency is not required (the link can be available to all users after some delay for instance)
  • The system must be highly available but not as much as a financial service would be (99.9% uptime is acceptable)


Capacity estimation

  • 1M pasted text per day
  • 10KB per text
  • ~10GB of new data each day
  • ~4TB per year


API design

v1/paste

inputs: text

returns: link


v1/view

input: unique ID

returns: text


Database design

PastedText table

  • ID INT (index)
  • UniqueID TEXT (index)
  • Content TEXT
  • CreationTimestamp DATETIME



High-level design

  • The main entry point is the API servers
  • Writing is made asynchronously using a queue system. This system is in charge to write the new text to the database and it replicates it to the CDN
  • Purging old pasted text is handled by a dedicated service.




Request flows

  • User calls api to paste text
  • Application server generate a new unique ID for the link sharing
  • Text is then serialized to the database
  • User wants to view a text
  • Application server first check if it is in the cache
  • If not it fetches it from the database


Detailed component design

  • One key point is the generation of a unique ID for the link sharing.
    • We should try to avoid a generating an ID that was already used in the past even if the associated text has been purged.
    • One possible solution to generate the unique ID: unique_id = hash(content + timestamp + serverid)
    • Another solution would be to generate a random string (using a different seed for each server). In case of collision we could simply generate another random string.


  • The purge service is a service that runs a scheduled task at regular intervals outside periods of peak activity. It uses the creationTimestamp of the pasted text to know if it must be deleted.


Trade offs/Tech choices

A lot of data is written each day so we need a high troughput

Strong consistency is not required in this context

A NoSQL database should be used so we can scale better



Failure scenarios/bottlenecks

There can be peaks of activity when users tries to write a lot of data:

  • We can handle the writing asynchronously in order to prevent this bottleneck
  • A queue system like Kafka or another technology can be used for this purpose


If any database node is not available, the text is not lost thanks to the queue

The api server could write the text synchronously as a fallback if the queue system isn't available


Future improvements

We can improve this design with analytics and logs

Some text could be pasted severial times so we could add a way to store them just once