System requirements
Functional:
- Given a URL, shorten it to a short one.
- The shorten URL will direct users to the correct site.
- The shortened URL expires after a year. The system saves the url mappings up to a year.
- The expiration period of one year for shortened URLs could be parameterized to allow for customization by users or for different use cases.
- we can have a background thread for cleaning up expired URLs in database, each url has a expiry timestamp, the thread periodically scans all records and delete expired items in bulk during off hours.
- User can provide a customized alias for their urls if they want to.
- the alias should have a length limit to be consistent with system generated ones.
- for customized alias, if there is a conflict with system generated ones, we should let user pick another alias.
Non-Functional:
- Reliable, same URL should generate same shortened one.
- fault-tolerant, the system should be highly available to tolerate downtimes, can be achieved through redundancy, replication, database replication such as single leader or multi leader or leaderless replication.
- highly scalable to handle more requests
- low latency, system should be able to return the shortened url quickly. Replication and partitioning should help.
- logging and monitoring of system metrics, CPU, memory utilization.
- url mappings should be kept in a secure manner, both long and short URL should be hashed for security reasons.
- data analytics on how frequently short URLs are being accessed, and user locations so we can redirect short urls to locations that are closer to end user.
Capacity estimation
1 million URLs per day.
each URL is about 100bytes.
1million * 100bytes = 100MB per day
shortened URL is about 10bytes
1million * 10bytes = 10MB
timestamp is 4bytes
metadata is about 100bytes
200Bytes * 1million = 200MB per day
200MB * 400 = 80GB per year
peak traffic : 5x, 400GB per year
API design
POST /api/shorten
request body {"longURL":"www.example.com", "alias": "my_alias"}
using hash algorithm sha-256 to shorten the url
Response: { "shortURL": "https://short.url/abc123" }
if provided alias has been seen in the db, return 400 invalid request and ask user to provide new one.
GET /originURL?shortURL="https://short.url/abc123"
Response: { "originURL": "www.example.com" }
redirectUser(shortURL)
GET /r/abc123
fetch originURL and do a 301 redirect
Database design
Schema:
shortURL, longURL, alias, expiry_timestamp, creation_timestamp, clicked_count
key value store like redis with TTL as expiry timestamp.
shortURL is primary key.
use SET data structure to store all the aliases.
once we invalidate an expired entry, we also remove the alias from the set.
other options:
do we need to optimize for read or write?
read can be optimized by using a cache.
for write throughput, we can use something like Cassandra, leaderless replication, fast writes. LSM tree structure
High-level design
URL Shortening Service
User enters longURL, the request gets distributed through a load balancer to one of the URL Shortening Service, the service will use a SHA-256 hashing algorithm to hash the longURL into a short one, if the user provided an alias, we would check against our Set to see if it has been used.
If not, we save it into our DB.
If we have a lot of traffic, it is better to use a message broker in between load balancer and the url shortening service, so that we don't overload our servers, the servers can process the requests at their own speed.
What type of message broker should we use?
option 1: in memory message broker like RabbitMQ, no order guarantee. Not as durable, more throughput (fanout), multiple competing consumers trying to process the messages.
option 2: LSM based , kafka, durable, replayability, order guarantee.
I think Kafka would work here if we want to process the requests in order. We can partition the messages based on hash value of longURL, have one consumer per partition.
URL retrieval service
user enters the shortURL in browser, URL retrieval service first gets the longURL by checking the cache, if not exist in cache, get it from the db directly then update the cache.
once we retrieved the longURL, we do a redirect to the origin. We can put our longURL in CDN so it is closer to the user.
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
for cache, a distributed key-value store like Redis should work well, the key is the shortURL, we can also utilize the Set data structure in Redis to keep track of unique alias. We can also set TTL for each record to help us expire the urls easily.
Redis is distributed, replicated, and scalable.
We can use consistent hashing to make sure same requests always go to the same Redis partitions.
Message queue
Kafka uses LSM tree and SSTables under the hood, which means the writes are fast, since we write to memory first and then disk.
The data is persisted so it provides durability. Consumers are aware of their last pulled offsets, so if for some reason the message did not get processed successfully, consumers can process them again by pulling from the last offset. The operations we perform (shorten the url) is idempotent, meaning no matter how many times we process this message, the result remains the same.
Kafka also guarantees order, this will make sure requests that came in first will be processed first.
We can partition Kafka by hash range value of longURLs , and have one consumer group per partition for higher throughput.
Database
Since the URL shortening service writes to the db directly, if there are lots of requests coming in, a database that can provide higher write throughput should help.
- relational db : this is not structured data, plus writes are slower in relational db given the b-tree index. so not a good fit.
- nosql db: faster writes due to LSM tree, something like Cassandra should work.
- we can also store longURLs in object storage like Amazon S3.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
caching strategies
- cache aside: read and write are separate, easy for client to implement. But cache can be stale (solve with TTL), round trip to update cache if cache miss. can use LRU for eviction.
- write through: synchronous process, cache is most up to date, but there can be delays with writing into db. data in cache may not be frequently accessed at all.
- write back: asynchronous, data might be lost when saving into db.
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Data loss.
Security vulnerabilities
system overload
single point of failure
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?
database replication
load balancer
rate limiting
distributed caching
horizontal scaling