System requirements


Functional:

  • Store user text (limit 50 pieces)
  • Text must be under 10k characters
  • User settable expiration for text (default 30 days; max 365 days)
  • Paste access is settable (public,shared,private)



Non-Functional:

  • Low latency
  • Eventual consistency




Capacity estimation

  • 10k DAU
  • 2 uploads / user / day
  • 20k uploads / day
  • Assume 1k characters avg per paste
  • 20k uploads * 1k characters * 1B / char = 200kB upload / day
  • 20k / 100k = 0.2 write QPS
  • Read:write ratio = 10
  • 2 read QPS


  • 200kB * 365 ~= 80MB per year


API design

Upload a paste to pastebin

POST /v1/pastes/

params: {

username

text

access

expiration (in days)

}

returns: {

url

}


Get a paste

Just follow the url




Database design

Pastes:

  • Since we only need about 80MB to store all paste data for a year, and the maximum expiration time is one year, we can easily fit all paste data in an in-memory cache like Redis. Redis is also useful here as it has a Time-to-live feature where it will evict data based on the expiration time we get from the user. Pastes will be keyed by their unique URL which is a short base 62 string. A paste object in the cache will store the following metadata:
  • User ID of its author
  • Access policy
  • Created At (timestamp)
  • Body (actual paste text)
  • expiration policy


Users:

  • User data wont be modified or read nearly as frequently as the pastes. A relational DB should be acceptable for Users. A row in the relational DB should at least store the following info for a given user:
  • User ID
  • Username
  • email




High-level design

Functional Blocks:

  • Web Server: Handles requests, authentication, authorization, Unique URL generation
  • Paste DB: Stores pastes
  • User DB: Stores User info






Request flows

Creating a new paste:

  • Client uploads a new paste
  • Load Balancer routes request to web server
  • Web server generates a new unique URL for the paste
  • Web server persists the paste to the Paste DB, keyed by it's URL
  • The web server returns the new URL to the client


Reading a paste:

  • The client follows the paste URL
  • The GET request is routed through the load balancer to the Web Server.
  • The Web Server looks up the paste in the Paste DB with the URL as the key.
  • The Web Server returns the body of the paste metadata to the client.



Detailed component design

  • Fault tolerance: This system likely fits easily onto one or two physical machines. However, this constitutes a single point of failure. In order to prevent data loss, this system should be replicated over multiple availability zones.
  • DB Design: The entire post DB fits into a single redis cache instance easily. However, redis is not durable. All writes to the redis cache should also be asynchronously forwarded to a persistent DB. A Key-Value store or even a simple log should work well here, since the DB will typically be written to, and only read when reloading the cache after an outage. To facilitate replicating the system into multiple datacenters, the writes to one DB instance should be asynchronously forwarded to its replicas.
  • Unique URL generation: since the system is now distributed geographically, we need to ensure that two pastes are not given the same URL. To achieve this, a URL should in some way contain the unique ID of the web server it was generated on to prevent collisions. This could be achieved by simply appending the Unique ID of the web server onto the end of the URL.



Trade offs/Tech choices

  • Monolithic architecture: This system has very few components, and the Web Server handles all of the business logic of the system. While this architecture can be considered a Single Point of Failure, it greatly mitigates the complexity that comes with separating functionality into different services.




Failure scenarios/bottlenecks

  • Web Server crashes: This could be solved by simply having two or three servers online in each availability zone, or simply routing requests to a different availability zone if this one is unresponsive.
  • Redis cache crashes: Paste writes will need to be routed to another datacenter for the time being, but once the Redis cache is back online, it can be reloaded from the local Paste DB. The Paste DB should have eventual consistency with its replicas.




Future improvements

  • Integrate a Pay-per-use system where users can pay a certain rate every time they upload a new paste. This could allow us to relax our 50 paste per user limit. If this lead to a significant increase in demand, we would likely need to scale our DB architecture, perhaps by sharding on a hash of the paste URL.