POST api/v1/shorten/:longUrl
-> 201 Created
GET api/v1/long/:shortUrl
-> 301 Moved Permannet
Here we need to focus on two aspects mainly shorten the url and get the url
We assume 10k writes per second and peak 100k
Then
10,000 * 3600 * 24 = 864 millon url/day
26 billon/month
315 billion/ year
URL shortener:
We will take 58 char left the confusing. We know we need roughly 1T url per year. So taking 58^8 , 8 char shorten url which is 128 T url combination sufficient enough
Storage and DB:
Lets assume temporary and permanent url split. Assume that around 250b urls max length 1 year and 75b max 10 year
Assuming 500 byte per url
So for a single year we need storage of,
250 + 750 = 1T url storage ,
which is 500 TB
We assume 5TB of of per db node so 100 db node needed
Based on number of visit and time created we will create a ranking list and used caching. We assume 1% url mostly visited . Which would be 50 gb . Can be handle with 4*16 shared redis cluster
We assume 100k read per second
Unique Id Generator
There can be generally two ideas for url shortening thing 1. Url encoding and 2. Key generation and check for duplicate in the db
As we already said we will have 58 characters so we will go for 58 base encoding.
We need to have a unique number in decimal and just go for 58 base encoding. For scale we may need multiple unique id generator . For simplicity lets assume we have 10 unique id generator that can have the id of unique id % 10 = id generator num. It will be increase one by one periodically
We can map these unique id generator nodes to db nodes . Because we can decode short url from fetch and can tell which db nodes the data in
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.
Here we need to think of CAP theorem . I think we can choose eventual consistency and focus on other two things
Thats why we can use consistent hashing among the servers
As latency is a important factor here we can use CDN as geographical location . As there is high chance created url will mostly visit from that geo graphic zone
Url collision
Url collision will likely not happen here as we are using counter and the creating unique identifier on base conversation. For scalability every unique id counter will start from different range and with min max functionality.
Cache and Storage
We will use redis shared cahed layer . Based ob 10% frequent read we will cache the value with a ttl of 1 day. It will be stored in CDN for faster delivery. In terms of cache miss it will be served from multiple read shareded db. For db we can use SQL based as no particular relation
CDN, Rate limiting & Load balancing
We can use geological based data for load balancing and cdn . As it is more likely to have the request for same geo logical place for a url shortener
Url cleanup
As we early say we will use counter . For the temporary url zone the counter will be reset after 1 year and that entry would be cleanup