Codemia | Master System Design Interviews Through Active Practice

My Solution for Designing a Simple URL Shortening Service: A TinyURL Approach

by luminous_vibe

System requirements

Functional:

1) Shorten URL

8 characters length
each character is Hex so 8 bits per
how to do this?
first approach will be hashing the link provided
md5, sha256 are hashing algorithms we can use
md5 is faster but more collisions
to handle collisions we should just retry until we find a free one

2) Redirect Shortened URL

302 redirect if available in our system
Error if not
Permanent redirect,
301 http code is beneficial since we can have caches
some browsers cache these so they dont hit our servers all the time
We can incorporate a cache so we dont have to hit the DB all the time

Non-Functional:

1) Scalable

Able to handle increasing load and also handle decreasing load (flexible)
Able to handle large amounts of traffic (~1M DAU)

2) Availability

Aiming for highly available

3) Consistency

We can accept eventual consistency

4) Durability

We don't want to lose data that has been committed to the system

Capacity estimation

We should expect around 1 million daily active users.

100 redirects per day each
1 times a day they create a new one

For traffic:

Redirects:
1e6 per day means 1e8 redirects per day
1e5 seconds per day meaning 1e7/1e5 is 1e3 redirects per second ~ 1000 TPS for redirects
very doable
Creation:
1e6 daily active users per day means 1e6 new links per day
1e6 / 1e5 is 1e1 which means ~10 TPS for creating new links

For throughput:

Redirect:
Lets say we get 1000TPS meaning its 1000 times size of the slug + size of the real link
So lets saya 1000TPS * 10KB ~ 10MB per second ~ not that bad
Creation
minimal
10 TPS is < 1MB per sec

Storage:

1e6 links per day, lets say 10KB per row
1e6 * 1e4 B ~ 1e10 B per day ~ 1e7KB per day ~ 1e4 MB per day ~10 GB per day
300 GB per month ~ 360 GB per year x 10 years is 3600 GB in 10 years
3.6 TB in 10 years ~ 20TB in 10 years

API design

generate_url

parameters:
API Key
long_url
returns
short_url
200 HTTP code if able to generate

redirect_url

parameters
API Key
short url
returns:
301 as permanent redirect
404 if not present in our system

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

user_table

user_id - primary key
creation_date
user_name

url_table

url_id - primary key
creation_date
user_id - foreign key
short_url - primary key
long_url
time_to_live

analytics_table

short_url
long_url
user_agent
user_ip
referal
http_code
success
device_type
network_type

High-level design

Components

Load balancer
Redirect Service
Creation Service
Database
Users
URLs
Analytics
Caches
Read flow

Request flows

Redirect flow:

1) user hits the load balancer first to determine where to redirect the user

2) we are pointed to the redirect service

3) now we have the request in the redirect service

4) we fetch the short url key from the cache

5) if its not in the cache, we query the database for the long url

6) if it is found, reply to the http request with a 301 and the long url

7) update the cache with key- value of shorturl: longurl

Creation flow:

1) user hits load balancer

2) we are sent to the creation service

3) here we generate a short url for the long url

4) lets say we do a hash for the long url ~ and then enode it to base64 so that the url is alphanumeric then truncate to 8 characters if its more than that

5) we will try to insert this in the database, if it fails, because there is a unique constraint on the short url, we retry

Detailed component design

Bottlenecks will come from the read path of our system. Meaning the cache and the database will be hit with a ton of traffic.

Both services are stateless, so it is easy to scale them by adding more servers.

1) Database consideration

We will use a relational db for our users and url tables
we can shard the database using the hash of the short url as the shard key
each shard will have a primary and secondary nodes - Two or three read replicas.
redirect service will target these read replicas
so now, the load will be shared with all these nodes since instead of just one node
We will use a database that can handle a lot of writes for our analytics table like cassandra
we choose cassandra since it can handle huge amounts of writes
it can handle huge amounts of writes since it uses a WAL and a LSM tree structure
optimized for writes

2) Cache consideration

in memory cache like Redis for the redirect service
we will also deploy a Redis in a distributed manner
we will have multiple nodes, splitting keys using consistent hashing
use refresh-ahead so that we can update the cache in advance
in CreationService we do a write through, so that we can update the cache right away

3) Load balancing

A few load balancing algorithms like round robin, least connections, least load, least response
we choose round robin for simplicity

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

1) We use a cassandra as our main analytics database since it can handle a lot of traffic for writes. Since it is write optimized, it is a lot better than a relational database since relational databases use b-trees and will have slower writes

2) Since we do sharding - we have to keep in mind that we cannot join easily. Since we dont really need to join any data, it is a good choice for us.

3) Since we do read replicas - we need to keep in mind the delay in updating the replicas. Since we have a cache that we update and since the first write is returned optimistically. it should be covered.

4) Since we have a few nodes/shards/replicas we need to know how to manage this system. A tool like Kubernetes, can be used to make sure our nodes are up and running.

Failure scenarios/bottlenecks

1) The load balancer is a bottleneck and could fail. We will have a backup loads balancer who is passive. Traffic will be redirected to it in case the main load balancer fails

2) Nodes in the database and caches could fail. Since we are using Kubernetes, we rely on it to bring them back up.

3) Servers running the services could fail. Again, we rely on Kubernetes/or other orchestration software to revive the services.

3) We rely on Kubernetes and it could fail.

4) When a node in the Redis cluster fails, we could get a lot of cache misses. Since we use consistent hashing, we only get a minimal amount of keys that get affected since they are split in between the nodes and virtual nodes

5) Some database nodes could get too much traffic/data if the sharding is not done properly - a hot spot. We have to monitor this so that we can improve our key distribution. Sharding the large node should be enough to fix the issue temporarily.

Future improvements

1) Adding availability zones like east and west coast. Adding data centres in multiple continents to get closer to users. This can reduce the load of the system since we balance the load among regions. Doing this also improves availability since now even if datacenters or availability zones are down, we can redirect to other regions.

2) Improved algorithms when generating shortened URLS. Since currently we hash it our selves. There is no causality, it is hard to order the shortened URLs. For this we can use a technique like SnowFlake which integrates some of the bits in the returned value so we are able to have some order.

3) Using Terraform to manage the whole system. Since now there are a lot of moving parts, we need to use something that can make managing our system easier.

4) Using Redis Sentinel to manage our cache clusters can help us handle failure events.

5) We also need to discuss how our databases handle partition events. Like in primary/secondary where have promote a secondary node to a primary. Voting mechanisms to who should be the new primary. How data is replicated among nodes. So we can gracefully recover from the backups/replicas.