Designing A Simple Url Shortening Service A TinyURL Approach - System Design

Requirements

Functional Requirements:

Create a short URL for a given long URL.
Return the long URL associated with a given short URL.

Non-Functional Requirements:

Latency and Reliability

API Design

POST api/v1/shorten/:longUrl

-> 201 Created

GET api/v1/long/:shortUrl

-> 301 Moved Permannet

High-Level Design

Here we need to focus on two aspects mainly shorten the url and get the url

We assume 10k writes per second and peak 100k

Then

10,000 * 3600 * 24 = 864 millon url/day

26 billon/month

315 billion/ year

URL shortener:

We will take 58 char left the confusing. We know we need roughly 1T url per year. So taking 58^8 , 8 char shorten url which is 128 T url combination sufficient enough

Storage and DB:

Lets assume temporary and permanent url split. Assume that around 250b urls max length 1 year and 75b max 10 year

Assuming 500 byte per url

So for a single year we need storage of,

250 + 750 = 1T url storage ,

which is 500 TB

We assume 5TB of of per db node so 100 db node needed

Based on number of visit and time created we will create a ranking list and used caching. We assume 1% url mostly visited . Which would be 50 gb . Can be handle with 4*16 shared redis cluster

We assume 100k read per second

Unique Id Generator

There can be generally two ideas for url shortening thing 1. Url encoding and 2. Key generation and check for duplicate in the db

As we already said we will have 58 characters so we will go for 58 base encoding.

Counter and Fault Tolerance

We need to have a unique number in decimal and just go for 58 base encoding. For scale we may need multiple unique id generator . For simplicity lets assume we have 10 unique id generator that can have the id of unique id % 10 = id generator num. It will be increase one by one periodically

We need counter nodes to be fault tolerance as it is crucial and possibility of collision. So we will use Quorum consensus system like zookeper that will asign ranges to counter nodes and if any of the counter node down it can assign a new one

We can map these unique id generator nodes to db nodes . Because we can decode short url from fetch and can tell which db nodes the data in

Detailed Component Design

Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.

Here we need to think of CAP theorem . I think we can choose eventual consistency and focus on other two things

Thats why we can use consistent hashing among the servers

As latency is a important factor here we can use CDN as geographical location . As there is high chance created url will mostly visit from that geo graphic zone

The Entry (Write Path) -> Short Url Creation

The user will POST api/v1/shorten/:longUrl and then hit rate limiter -> api gateway and the layer 7 load balancer
From the load balancer with the quorum consensus based counter(range fixed) -> unique id generated -> based converted -> write to that ranged based shared db
Return 201 created

The Read path -> Redirect from shorten url

The user will GET api/v1/long/:shortUrl . It will then check DNS and CDN based edge caching if the url is already there
Hit the rate limiter & load balancer -> api gateway. With decoding the short url we already know which range is this. So which node to look on Then the redis caching would be check
If cache miss then it will return from that ranged based shareded db entry
It will return 301 permanent redirect and also by the url visit it will adjust the popularity counter that help to cache the most popular url by day

Url collision

Url collision will likely not happen here as we are using counter and the creating unique identifier on base conversation. For scalability every unique id counter will start from different range and with min max functionality.

Cache and Storage

We will use redis shared cahed layer . Based ob 10% frequent read we will cache the value with a ttl of 1 day. It will be stored in CDN for faster delivery. In terms of cache miss it will be served from multiple read shareded db.

For db here are no particular relation so there is no dependency to use any SQL based thing . Its better to use no SQL things like Casandra or DynamoDB here. We will use ScyllaDB because its fast and horizontally scalable . It use ring based consistent hashing. And for horizontally scaling the db we can use sharding. About the sharding technique as we using counter we can do range base sharding like 1-100000 in node A, 100001- 200000 in node B etc

CDN, Rate limiting & Load balancing

We can use geological based data for load balancing and cdn . As it is more likely to have the request for same geo logical place for a url shortener

Url cleanup

As we say earlier there are some temporary url as well , So we will use a field ttl to clean them up. For the temporary url the counter range will also be different so that we can reset the counter also

Limitation and Future Scope

We know that here we use incremental based counter and then based conversation that gives us some advantage in a distributed system but there are some limitation as well. Like incremental based id can easily be predictable so security concern is there.

In future like we devide temporary url and permanent url. We can have a separate system like secure url . The api can be like api/v1/secure/shorten/:longUrl

The thing is for the secure url things instead of using incremental counter we can use random unique id -> base conversation -> check for duplicate and go on. We can also explore some hashing based techniques there