POST api/v1/shorten/:longUrl
-> 201 Created
GET api/v1/long/:shortUrl
-> 301 Moved Permannet
Here we need to focus on two aspects mainly shorten the url and get the url
We assume 10k writes per second and peak 100k
Then
10,000 * 3600 * 24 = 864 millon url/day
26 billon/month
315 billion/ year
URL shortener:
We will take 58 char left the confusing. We know we need roughly 1T url per year. So taking 58^8 , 8 char shorten url which is 128 T url combination sufficient enough
Storage and DB:
Lets assume temporary and permanent url split. Assume that around 250b urls max length 1 year and 75b max 10 year
Assuming 500 byte per url
So for a single year we need storage of,
250 + 750 = 1T url storage ,
which is 500 TB
We assume 5TB of of per db node so 100 db node needed
Based on number of visit and time created we will create a ranking list and used caching. We assume 1% url mostly visited . Which would be 50 gb . Can be handle with 4*16 shared redis cluster
We assume 100k read per second
Unique Id Generator
There can be generally two ideas for url shortening thing 1. Url encoding and 2. Key generation and check for duplicate in the db
As we already said we will have 58 characters so we will go for 58 base encoding.
Counter and Fault Tolerance
We need to have a unique number in decimal and just go for 58 base encoding. For scale we may need multiple unique id generator . For simplicity lets assume we have 10 unique id generator that can have the id of unique id % 10 = id generator num. It will be increase one by one periodically
We need counter nodes to be fault tolerance as it is crucial and possibility of collision. So we will use Quorum consensus system like zookeper that will asign ranges to counter nodes and if any of the counter node down it can assign a new one
We can map these unique id generator nodes to db nodes . Because we can decode short url from fetch and can tell which db nodes the data in
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.
Here we need to think of CAP theorem . I think we can choose eventual consistency and focus on other two things
Thats why we can use consistent hashing among the servers
As latency is a important factor here we can use CDN as geographical location . As there is high chance created url will mostly visit from that geo graphic zone
The Entry (Write Path) -> Short Url Creation
The Read path -> Redirect from shorten url
Url collision
Url collision will likely not happen here as we are using counter and the creating unique identifier on base conversation. For scalability every unique id counter will start from different range and with min max functionality.
Cache and Storage
We will use redis shared cahed layer . Based ob 10% frequent read we will cache the value with a ttl of 1 day. It will be stored in CDN for faster delivery. In terms of cache miss it will be served from multiple read shareded db.
For db here are no particular relation so there is no dependency to use any SQL based thing . Its better to use no SQL things like Casandra or DynamoDB here. We will use ScyllaDB because its fast and horizontally scalable . It use ring based consistent hashing. And for horizontally scaling the db we can use sharding. About the sharding technique as we using counter we can do range base sharding like 1-100000 in node A, 100001- 200000 in node B etc
CDN, Rate limiting & Load balancing
We can use geological based data for load balancing and cdn . As it is more likely to have the request for same geo logical place for a url shortener
Url cleanup
As we say earlier there are some temporary url as well , So we will use a field ttl to clean them up. For the temporary url the counter range will also be different so that we can reset the counter also
Limitation and Future Scope
We know that here we use incremental based counter and then based conversation that gives us some advantage in a distributed system but there are some limitation as well. Like incremental based id can easily be predictable so security concern is there.
In future like we devide temporary url and permanent url. We can have a separate system like secure url . The api can be like api/v1/secure/shorten/:longUrl
The thing is for the secure url things instead of using incremental counter we can use random unique id -> base conversation -> check for duplicate and go on. We can also explore some hashing based techniques there