My Solution for Designing a Simple URL Shortening Service: A TinyURL Approach
by luminous_vibe
System requirements
Functional:
1) Shorten URL
- 8 characters length
- each character is Hex so 8 bits per
- how to do this?
- first approach will be hashing the link provided
- md5, sha256 are hashing algorithms we can use
- md5 is faster but more collisions
- to handle collisions we should just retry until we find a free one
2) Redirect Shortened URL
- 302 redirect if available in our system
- Error if not
- Permanent redirect,
- 301 http code is beneficial since we can have caches
- some browsers cache these so they dont hit our servers all the time
- We can incorporate a cache so we dont have to hit the DB all the time
Non-Functional:
1) Scalable
- Able to handle increasing load and also handle decreasing load (flexible)
- Able to handle large amounts of traffic (~1M DAU)
2) Availability
- Aiming for highly available
3) Consistency
- We can accept eventual consistency
4) Durability
- We don't want to lose data that has been committed to the system
Capacity estimation
We should expect around 1 million daily active users.
- 100 redirects per day each
- 1 times a day they create a new one
For traffic:
- Redirects:
- 1e6 per day means 1e8 redirects per day
- 1e5 seconds per day meaning 1e7/1e5 is 1e3 redirects per second ~ 1000 TPS for redirects
- very doable
- Creation:
- 1e6 daily active users per day means 1e6 new links per day
- 1e6 / 1e5 is 1e1 which means ~10 TPS for creating new links
For throughput:
- Redirect:
- Lets say we get 1000TPS meaning its 1000 times size of the slug + size of the real link
- So lets saya 1000TPS * 10KB ~ 10MB per second ~ not that bad
- Creation
- minimal
- 10 TPS is < 1MB per sec
Storage:
- 1e6 links per day, lets say 10KB per row
- 1e6 * 1e4 B ~ 1e10 B per day ~ 1e7KB per day ~ 1e4 MB per day ~10 GB per day
- 300 GB per month ~ 360 GB per year x 10 years is 3600 GB in 10 years
- 3.6 TB in 10 years ~ 20TB in 10 years
API design
generate_url
- parameters:
- API Key
- long_url
- returns
- short_url
- 200 HTTP code if able to generate
redirect_url
- parameters
- API Key
- short url
- returns:
- 301 as permanent redirect
- 404 if not present in our system
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
user_table
- user_id - primary key
- creation_date
- user_name
url_table
- url_id - primary key
- creation_date
- user_id - foreign key
- short_url - primary key
- long_url
- time_to_live
analytics_table
- short_url
- long_url
- user_agent
- user_ip
- referal
- http_code
- success
- device_type
- network_type
High-level design
Components
- Load balancer
- Redirect Service
- Creation Service
- Database
- Users
- URLs
- Analytics
- Caches
- Read flow
Request flows
Redirect flow:
1) user hits the load balancer first to determine where to redirect the user
2) we are pointed to the redirect service
3) now we have the request in the redirect service
4) we fetch the short url key from the cache
5) if its not in the cache, we query the database for the long url
6) if it is found, reply to the http request with a 301 and the long url
7) update the cache with key- value of shorturl: longurl
Creation flow:
1) user hits load balancer
2) we are sent to the creation service
3) here we generate a short url for the long url
4) lets say we do a hash for the long url ~ and then enode it to base64 so that the url is alphanumeric then truncate to 8 characters if its more than that
5) we will try to insert this in the database, if it fails, because there is a unique constraint on the short url, we retry
Detailed component design
Bottlenecks will come from the read path of our system. Meaning the cache and the database will be hit with a ton of traffic.
Both services are stateless, so it is easy to scale them by adding more servers.
1) Database consideration
- We will use a relational db for our users and url tables
- we can shard the database using the hash of the short url as the shard key
- each shard will have a primary and secondary nodes - Two or three read replicas.
- redirect service will target these read replicas
- so now, the load will be shared with all these nodes since instead of just one node
- We will use a database that can handle a lot of writes for our analytics table like cassandra
- we choose cassandra since it can handle huge amounts of writes
- it can handle huge amounts of writes since it uses a WAL and a LSM tree structure
- optimized for writes
2) Cache consideration
- in memory cache like Redis for the redirect service
- we will also deploy a Redis in a distributed manner
- we will have multiple nodes, splitting keys using consistent hashing
- use refresh-ahead so that we can update the cache in advance
- in CreationService we do a write through, so that we can update the cache right away
3) Load balancing
- A few load balancing algorithms like round robin, least connections, least load, least response
- we choose round robin for simplicity
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
1) We use a cassandra as our main analytics database since it can handle a lot of traffic for writes. Since it is write optimized, it is a lot better than a relational database since relational databases use b-trees and will have slower writes
2) Since we do sharding - we have to keep in mind that we cannot join easily. Since we dont really need to join any data, it is a good choice for us.
3) Since we do read replicas - we need to keep in mind the delay in updating the replicas. Since we have a cache that we update and since the first write is returned optimistically. it should be covered.
4) Since we have a few nodes/shards/replicas we need to know how to manage this system. A tool like Kubernetes, can be used to make sure our nodes are up and running.
Failure scenarios/bottlenecks
1) The load balancer is a bottleneck and could fail. We will have a backup loads balancer who is passive. Traffic will be redirected to it in case the main load balancer fails
2) Nodes in the database and caches could fail. Since we are using Kubernetes, we rely on it to bring them back up.
3) Servers running the services could fail. Again, we rely on Kubernetes/or other orchestration software to revive the services.
3) We rely on Kubernetes and it could fail.
4) When a node in the Redis cluster fails, we could get a lot of cache misses. Since we use consistent hashing, we only get a minimal amount of keys that get affected since they are split in between the nodes and virtual nodes
5) Some database nodes could get too much traffic/data if the sharding is not done properly - a hot spot. We have to monitor this so that we can improve our key distribution. Sharding the large node should be enough to fix the issue temporarily.
Future improvements
1) Adding availability zones like east and west coast. Adding data centres in multiple continents to get closer to users. This can reduce the load of the system since we balance the load among regions. Doing this also improves availability since now even if datacenters or availability zones are down, we can redirect to other regions.
2) Improved algorithms when generating shortened URLS. Since currently we hash it our selves. There is no causality, it is hard to order the shortened URLs. For this we can use a technique like SnowFlake which integrates some of the bits in the returned value so we are able to have some order.
3) Using Terraform to manage the whole system. Since now there are a lot of moving parts, we need to use something that can make managing our system easier.
4) Using Redis Sentinel to manage our cache clusters can help us handle failure events.
5) We also need to discuss how our databases handle partition events. Like in primary/secondary where have promote a secondary node to a primary. Voting mechanisms to who should be the new primary. How data is replicated among nodes. So we can gracefully recover from the backups/replicas.