System requirements


Functional:

By given long URL, if already has shortening URL, return it, or else generate the unique shorten URL, and store in the db.


By given shorten URL, return the original long URL.


Optional: User-defined shorten URL, expiration time setting.


Non-Functional:

1000 QPS, 300 TPS

High availability.

Low latency. P95 1s




Capacity estimation

One pair = 200Bytes

200 * 300 * 60 * 60 *24 = 5.2GB / day * 365 = 2TB / year ~= 20TB / 10 years



API design

POST createShorturl {origin_url(String) } : return short_url(String) HTTP_STATUS 200


GET getOriginurl {short_url(String)} : return origin_url(String)

HTTP_STATUS 302 (redirect)


GET getOriginurl {short_url(String)} : return error

HTTP_STATUS 404 (not found)


Database design

{

short_url: String (search index)

origin_url: String (search index)

}


Use NoSQL Database


High-level design

Please see the high level diagram






Request flows

When client initiates a request, it is first directed to the load balancer, load balancer will assign the request to different servers by round-robin to balance the load between URL servers.


If it's a createShorturl request:

  1. First visit cache, check if the origin_url already exist. If hit cache, return the short_url directly.
  2. If not hit in cache, search in the Database. If the origin_url already in the database, return the short_url directly.
  3. If origin_url is seen the first time, Hash it to form the short_url, store the {short_url, origin_url} in DB. Return the generated short_url.


If it's a getOriginurl request:

  1. First visit cache, if hit cache, return the origin_url directly.
  2. If not hit in cache, search in the Database and return the origin_url.
  3. If not in the database, throw error and return 404 as response.





Detailed component design

Use SHA 256 to hash the (given URL + fixed buffer string), and get the first 7 digits. If hashed value already exists, using the (hashed result + fixed buffer string) to hash again untill we get a unique value.


We can use both uppercase, lowercase and digits for the hash function, so we can generate (26+26+10)^7 = 3500B unique URLs.




Trade offs/Tech choices

Database SQL vs NoSQL - I choose NoSQL based on following reasons:

High scalability, can do horizontal scaling.

The data is just a pair of URLs, no stroing relation with other data.

Supports fast read and write.

More affordable for large volumn data.


Cache Stratergy:

  1. Cache aside: If there's a cache miss, get data from DB and add an entry for the cache, then return the result.
  2. Pros: Very fast if we hit the cache, but it depends on how the invalidation mechanism goes.
  3. Cons: A cache miss will introduce extra steps, which costs more time.
  4. Write through: Whenever a new entry coming in, it writes into the cache and database simultanously.
  5. Pros: High consistency.
  6. Cons: Slower than cache aside.
  7. Write back: write to cache first, then async write to DB.
  8. Pros: high write speed for write-heavy system.
  9. Cons: Data might be inconsistency, and if cache server down before sending the data to DB, it will cause data loss.

Based on the pros and cons, I'd like to use cache aside combine with a suitable invalidation mechanism in this system.


Cache invalidation mechanism:

  1. Least-recently use
  2. Least-frequently use

It kind of depends on the user habit, if the analytics shows that some of the short_urls are very often to request, we might want to apply Least-frequently use. If the analytics shows that user prefer to request recently generated urls, then we might want to apply LRU.



Failure scenarios/bottlenecks

Single load balancer could be the single point of failure



Future improvements

To robust the system we could introduce more load balancers.