My Solution for Designing a Simple URL Shortening Service: A TinyURL Approach with Score: 6/10
by mystic7375
System requirements
Functional:
given long url, return shorter url
given shorter url, able to redirect to original web
Non-Functional:
support large volume of url storage
support high read traffic
high available
low latency
Capacity estimation
generate 10/s, 1day is : 10 * 86400 = 800k
1 year, new added is 800k * 365 = 25M
read-to-write ratio is: 100:1, then is 1k/s
suppose run 10 year: 250M rows
if we choose base62 url, then 62^5=250M, so the short url length just need 5 characters, while it may encounter collision as time goes on, so setting with 62^6 = 56G is better to avoid collision in the future
storage need: 250M * 1KB = 250GB
API design
1: generate(long url, optional expiration_time)
return a shorter url, expire time is optional
2: get(short url)
return long url and redirect to original web, or return error mentioning that url is expired or does not exist
Database design
consider noSQL (key value storage) since it's just a mapping from string to another string, db should be partitioned on short url for scalability and contribute to fast lookups, also do replicate for high availability (to avoid single node failure)
High-level design
generate query -> API gateway -> LB -> server -> generate base62 url to represent url -> store it
the returned url using counter-based strategy, and I need a global incremented id for it
query short url -> API gateway ->LB -> server (with cache) -> read from db and return
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
counter based strategy: use ZK to provide sequence increment service, this help avoid conflict due to consistency protocol in ZK. also provide strong fault tolerant in case of single point failure.
since read-to-write ratio is very high, so introduce cache to offload database pressure, also cluster cache is better like redis cluster, usually implemented with consistent hash to decide which cache server to query
db should be horizontal scaling, partitioned to multiple subset and use range-based method to determine which partition to query, each partition is further replicated to multiple regional to increase availability. prioritize eventual consistent over strong consistency as little delay is acceptable. (i.e. prioritize more on availability over consistency)
Trade offs/Tech choices
while ZK is good, but it can not support high traffic, therefore, split to multiple range server is better, e.g. one server for assign id from 1 - 1000, another for id from 1001 - 2000, when some server exhaust the id, request new range from a ZK server instead. assigning server can using sql transaction for simplicity and avoid conflict.
Failure scenarios/bottlenecks
Try to discuss as many failure
Future improvements
for the hot link, choose CDN to accelarete it,, which will be stored in Edge server (POP), closer to user location and faster, also reduce burden on our server
do url validation, avoid bad website to redirect to
consider expiration, when do the mapping, if it's expired, should return null for this mapping, also set up a background service to perodically clean up expired items in db