Requirements
Functional Requirements:
- User can create a new pastebin page. When user click save or share button, nny text in page is stored and the page url can be used to share to multiple user.
- User can set an expiration time for pastebin. The system MUST delete the pastebin data when expired.
- User can neither update nor delete pastebin
- Many users can read the paste bin after the url is created.
Estimation
- 1 millions new paste at a time. Avg pastebin is 10KB, and max is 1MB
- Around 10GB of data will be stored daily. If we consider peak day will be twice as much as normal, it will be 20GB.
- We need 3-5 TB of data per year, assume the service will last 10 year, we need 30-50 TB of storage.
- Around 12 paste per second, peak might be 120 paste per second.
- Around 10GB of data will be stored daily. If we consider peak day will be twice as much as normal, it will be 20GB.
- The read-write ratio is 10:1, so the read query per second should be 120 RPS, but peak can up to 1000 RPS.
Non-Functional Requirements:
- The datastore should able to handle 20GB of data daily and can store around 3 TB a year.
- The latency should be not much. For write, it should about 100ms and for read request, the response time should around 10ms.
- The system should be handle a large number of read requests, and can be scale horizontally when needed.
- The system should ensure high availability, maybe up to 99.99% and can be fall back to other regions.
- The system should be reliable. Data should be persistent and last for at least 10 years.
API Design
- POST /api/v1/pastebin
- requests:
- string data: pastebin text data, can store up to 1MB
- optional expire_time: pastebin expire time, if not set, the data won't be expired.
- respones:
- string ID: pastebin unique id
- requests:
- GET /api/v1/pastebin/:id
- params:`
- string id: pastebin unique id
- responses:
- data: pastebin data
- metadata: pastebin metadata (such as created_at, expired_at, etc)
- params:`
High-Level Design
In high level design, we have a simple but effective workflow, as drawed in diagram. Basically, we will have 2 flows:
- Create pastebin:
- Client will request to our API gateway, at there, we will do many things
- Authentication
- Rate limit checking
- TLS termination
- Load balancing between server
- From API gateway, we will call Server to create pastebin. The pastebin is stored persistently at Database and quickly response to user.
- Client will request to our API gateway, at there, we will do many things
- Get pastebin
- Client will request to a CDN close to user. At there, if the CDN already has our pastebin, it will return the pastebin, reduce requests to our system and so do response time.
- If CDN do not have our content, we will request the pastebin to our server and reach API Gateway.
- In API gateway, we also do some things:
- Rate limit checking
- TLS termination
- Load balancing between server
- The API gateway will call Get pastebin API to our server.
- The server will:
- Check in cache if the pastebin exists.
- Check key exists using bloom filter, it is very effective in checking if key not exists.
- Check key is previously requested, if yes, response and avoid calling DB
- If key might exists, but not have in cache, we will fetch the database.
- Check in cache if the pastebin exists.
Detailed Component Design
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.