System requirements
Functional:
- Users can store texts
- Texts can be shared via a unique link
- Texts have a defined lifetime
Non-Functional:
- Displaying the text using the link must be fast (< 250ms)
- Strong consistency is not required (the link can be available to all users after some delay for instance)
- The system must be highly available but not as much as a financial service would be (99.9% uptime is acceptable)
Capacity estimation
- 1M pasted text per day
- 10KB per text
- ~10GB of new data each day
- ~4TB per year
API design
v1/paste
inputs: text
returns: link
v1/view
input: unique ID
returns: text
Database design
PastedText table
- ID INT (index)
- UniqueID TEXT (index)
- Content TEXT
- CreationTimestamp DATETIME
High-level design
- The main entry point is the API servers
- Writing is made asynchronously using a queue system. This system is in charge to write the new text to the database and it replicates it to the CDN
- Purging old pasted text is handled by a dedicated service.
Request flows
- User calls api to paste text
- Application server generate a new unique ID for the link sharing
- Text is then serialized to the database
- User wants to view a text
- Application server first check if it is in the cache
- If not it fetches it from the database
Detailed component design
- One key point is the generation of a unique ID for the link sharing.
- We should try to avoid a generating an ID that was already used in the past even if the associated text has been purged.
- One possible solution to generate the unique ID: unique_id = hash(content + timestamp + serverid)
- Another solution would be to generate a random string (using a different seed for each server). In case of collision we could simply generate another random string.
- The purge service is a service that runs a scheduled task at regular intervals outside periods of peak activity. It uses the creationTimestamp of the pasted text to know if it must be deleted.
Trade offs/Tech choices
A lot of data is written each day so we need a high troughput
Strong consistency is not required in this context
A NoSQL database should be used so we can scale better
Failure scenarios/bottlenecks
There can be peaks of activity when users tries to write a lot of data:
- We can handle the writing asynchronously in order to prevent this bottleneck
- A queue system like Kafka or another technology can be used for this purpose
If any database node is not available, the text is not lost thanks to the queue
The api server could write the text synchronously as a fallback if the queue system isn't available
Future improvements
We can improve this design with analytics and logs
Some text could be pasted severial times so we could add a way to store them just once