Requirements


Functional Requirements:

  • User can create a new pastebin page. When user click save or share button, nny text in page is stored and the page url can be used to share to multiple user.
    • User can set an expiration time for pastebin. The system MUST delete the pastebin data when expired.
    • User can neither update nor delete pastebin
  • Many users can read the paste bin after the url is created.


Estimation

  • 1 millions new paste at a time. Avg pastebin is 10KB, and max is 1MB
    • Around 10GB of data will be stored daily. If we consider peak day will be twice as much as normal, it will be 20GB.
      • We need 3-5 TB of data per year, assume the service will last 10 year, we need 30-50 TB of storage.
    • Around 12 paste per second, peak might be 120 paste per second.
  • The read-write ratio is 10:1, so the read query per second should be 120 RPS, but peak can up to 1000 RPS.

Non-Functional Requirements:

  • The datastore should able to handle 20GB of data daily and can store around 3 TB a year.
  • The latency should be not much. For write, it should about 100ms and for read request, the response time should around 10ms.
  • The system should be handle a large number of read requests, and can be scale horizontally when needed.
  • The system should ensure high availability, maybe up to 99.99% and can be fall back to other regions.
  • The system should be reliable. Data should be persistent and last for at least 10 years.


API Design

  • POST /api/v1/pastebin
    • requests:
      • string data: pastebin text data, can store up to 1MB
      • optional expire_time: pastebin expire time, if not set, the data won't be expired.
    • respones:
      • string ID: pastebin unique id
  • GET /api/v1/pastebin/:id
    • params:`
      • string id: pastebin unique id
    • responses:
      • data: pastebin data
      • metadata: pastebin metadata (such as created_at, expired_at, etc)


High-Level Design

In high level design, we have a simple but effective workflow, as drawed in diagram. Basically, we will have 2 flows:

  • Create pastebin:
    • Client will request to our API gateway, at there, we will do many things
      • Authentication
      • Rate limit checking
      • TLS termination
      • Load balancing between server
    • From API gateway, we will call Server to create pastebin. The pastebin is stored persistently at Database and quickly response to user.
  • Get pastebin
    • Client will request to a CDN close to user. At there, if the CDN already has our pastebin, it will return the pastebin, reduce requests to our system and so do response time.
    • If CDN do not have our content, we will request the pastebin to our server and reach API Gateway.
    • In API gateway, we also do some things:
      • Rate limit checking
      • TLS termination
      • Load balancing between server
    • The API gateway will call Get pastebin API to our server.
    • The server will:
      • Check in cache if the pastebin exists.
        • Check key exists using bloom filter, it is very effective in checking if key not exists.
        • Check key is previously requested, if yes, response and avoid calling DB
      • If key might exists, but not have in cache, we will fetch the database.


Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.