System requirements


Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

  1. Allow users to input a large text and store it in the system.
  2. Generate a unique URL for each text input that users can access (the user can also name the text).
  3. Automatically delete the stored text after 30 days.
  4. Support retrieving the stored input text by the generated URL.
  5. Allow users to delete the paste they have stored.


Non-Functional:

List non-functional requirements for the system...

  1. The system should be able to handle a large number of concurrent users and store a significant amount of text data efficiently
  2. Text retrieval and storage operations should have low latency to provide a seamless user experience. P99 should be less than 300ms.
  3. The system should ensure high availability
  4. The pastebin generated URL should be hard to guess so as to avoid unathorized access to the pasted content
  5. Once the content has been stored, it should not be lost unless it was deleted or expired. And the content delivery should be reliable


API design

Define what APIs are expected from the system...


// The API to post the text to store, this API should be rate limited per-IP. An example would // be allowing 5 pastes/s for an IP.

POST v1/text/

paste: String // the text to store

language: enum

response: 201 created (textId)


GET v1/text/{textId}

response: 200 -

paste: string // text to return

language: enum

404 - paste not found

400 - bad request


DELETE v1/test/{textId}

response 204

404 - The content doesn't exist

401 - The user does not have permission to delete the content


High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...


The system should consist of a client, a server, a database to serve the basic requests. In order to remove the expired texts from the database, there should be a seperate cronjob that is executed daily.


When reading from the generated url, the clients will send a request to the server, and the server retrieves the mapping from database. Once retrieved, the pastes and metadata will be returned to client in above mentioned response format.


A load balancer should be placed in front of the servers, it is in charge of making sure that no single server is overhitted and the traffic is distributted evenly across the servers.


A cache should be introduced to store frequently accessed records. Especially those pastes that are accessed very often. If a paste is deleted, it should also be deleted from cache. The cache can be implemented with a least recently reused eviction strategy. If there is a partial outage where the database access is unavailable, the server can provide degradated service by serving records from cache as long as that is available.



Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


  1. The server. Generating URL for pastes: Once a client post a paste, we generate a URL in form of something like /pastes/{user_id}/unique_id. To generate the unique Id, we could use [0-9][a-z][A-Z] characters in the url, and that makes it 62 characters. To make sure all urls are unique, initially we need to make sure the url hash is at least 3 characters in length, however as more and more users are joining and using the service, we can maintain longer urls. We should generate a random string for the url. If there a collision is found when inserting this into database, re-generate the string. The whole process consists of client post paste request -> server generate URL -> storing paste and database into database -> returning the generated url to client. As the urls are namespaced by user id, collision should be infrequent and should not generate a large amount of delay.
  2. Database. Initially, the database is very small and can be fitted into a single machine, however, in order to maintain high availability, we should provide duplications of the database. Initially we can have 1 redundant duplication of the database, and use a master-slave config. We should always write to master database machine and have it send its data synchronized to slaves. The slaves could serve reads.
    1. When user base has grown enough and the database size has grown large, sharding of the database should be introduced. As the number of users is expected to be growing continuously, consistent hashing should be used to distribute the stored content.
  3. Similarly, our server should have duplication, to serve all the requests. Besides, when more and more users are joining the platform, more machine should be added in the server layer.
  4. As mentioned above, a cache can be implemented to enhance systems read performance. This could specifcally be done using Redis. Everytime a read request arrives, the server should try to fetch the paste from the cache first, if that fails, the server will then resort to fetch the record in database, and store that into cache before returning.
  5. Expired pastes should be marked as EXPIRED in database and evicted in cache. When a user attempts to read expired pastes, we should return 404.
  6. Rate limiting should be implement per-ip based, so that a single user cannot abuse the system.
  7. There should be a hash value for each paste, and when we hash the paste, we will know if there is an existing duplication. Therefore we will not create duplicate records.
  8. A lock on miss strategy can be used to protect the database from thundering herd. With it, a process, if get a cache miss, must acquire a lock before reading from database. If it cannot acquire the lock, it will have to wait for a while and retry the cache.



Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


Table: Text

Id (varchar): id of text

ownerId (varchar): customerId of the creator of the text.

textToStore (varchar): the text to store

uri: uriOfText

insertTime (DateTime)

expireTime (DateTime)