System requirements


Functional:

  • Given a URL, shorten it to a short one.
  • The shorten URL will direct users to the correct site.
  • The shortened URL expires after a year. The system saves the url mappings up to a year.
  • The expiration period of one year for shortened URLs could be parameterized to allow for customization by users or for different use cases.
  • we can have a background thread for cleaning up expired URLs in database, each url has a expiry timestamp, the thread periodically scans all records and delete expired items in bulk during off hours.
  • User can provide a customized alias for their urls if they want to.
  • the alias should have a length limit to be consistent with system generated ones.
  • for customized alias, if there is a conflict with system generated ones, we should let user pick another alias.


Non-Functional:

  • Reliable, same URL should generate same shortened one.
  • fault-tolerant, the system should be highly available to tolerate downtimes, can be achieved through redundancy, replication, database replication such as single leader or multi leader or leaderless replication.
  • highly scalable to handle more requests
  • low latency, system should be able to return the shortened url quickly. Replication and partitioning should help.
  • logging and monitoring of system metrics, CPU, memory utilization.
  • url mappings should be kept in a secure manner, both long and short URL should be hashed for security reasons.
  • data analytics on how frequently short URLs are being accessed, and user locations so we can redirect short urls to locations that are closer to end user.



Capacity estimation

1 million URLs per day.

each URL is about 100bytes.

1million * 100bytes = 100MB per day

shortened URL is about 10bytes

1million * 10bytes = 10MB

timestamp is 4bytes

metadata is about 100bytes

200Bytes * 1million = 200MB per day

200MB * 400 = 80GB per year


peak traffic : 5x, 400GB per year


API design

 POST /api/shorten

request body {"longURL":"www.example.com", "alias": "my_alias"}

using hash algorithm sha-256 to shorten the url

Response: { "shortURL": "https://short.url/abc123" }


if provided alias has been seen in the db, return 400 invalid request and ask user to provide new one.



GET /originURL?shortURL="https://short.url/abc123"

Response: { "originURL": "www.example.com" }


redirectUser(shortURL)

GET /r/abc123

fetch originURL and do a 301 redirect




Database design

Schema:

shortURL, longURL, alias, expiry_timestamp, creation_timestamp, clicked_count


key value store like redis with TTL as expiry timestamp.

shortURL is primary key.



use SET data structure to store all the aliases.


once we invalidate an expired entry, we also remove the alias from the set.


other options:

do we need to optimize for read or write?

read can be optimized by using a cache.

for write throughput, we can use something like Cassandra, leaderless replication, fast writes. LSM tree structure






High-level design

URL Shortening Service

User enters longURL, the request gets distributed through a load balancer to one of the URL Shortening Service, the service will use a SHA-256 hashing algorithm to hash the longURL into a short one, if the user provided an alias, we would check against our Set to see if it has been used.

If not, we save it into our DB.




If we have a lot of traffic, it is better to use a message broker in between load balancer and the url shortening service, so that we don't overload our servers, the servers can process the requests at their own speed.


What type of message broker should we use?

option 1: in memory message broker like RabbitMQ, no order guarantee. Not as durable, more throughput (fanout), multiple competing consumers trying to process the messages.


option 2: LSM based , kafka, durable, replayability, order guarantee.


I think Kafka would work here if we want to process the requests in order. We can partition the messages based on hash value of longURL, have one consumer per partition.



URL retrieval service

user enters the shortURL in browser, URL retrieval service first gets the longURL by checking the cache, if not exist in cache, get it from the db directly then update the cache.


once we retrieved the longURL, we do a redirect to the origin. We can put our longURL in CDN so it is closer to the user.





Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...






Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


for cache, a distributed key-value store like Redis should work well, the key is the shortURL, we can also utilize the Set data structure in Redis to keep track of unique alias. We can also set TTL for each record to help us expire the urls easily.

Redis is distributed, replicated, and scalable.

We can use consistent hashing to make sure same requests always go to the same Redis partitions.


Message queue

Kafka uses LSM tree and SSTables under the hood, which means the writes are fast, since we write to memory first and then disk.

The data is persisted so it provides durability. Consumers are aware of their last pulled offsets, so if for some reason the message did not get processed successfully, consumers can process them again by pulling from the last offset. The operations we perform (shorten the url) is idempotent, meaning no matter how many times we process this message, the result remains the same.

Kafka also guarantees order, this will make sure requests that came in first will be processed first.

We can partition Kafka by hash range value of longURLs , and have one consumer group per partition for higher throughput.


Database

Since the URL shortening service writes to the db directly, if there are lots of requests coming in, a database that can provide higher write throughput should help.

  • relational db : this is not structured data, plus writes are slower in relational db given the b-tree index. so not a good fit.
  • nosql db: faster writes due to LSM tree, something like Cassandra should work.
  • we can also store longURLs in object storage like Amazon S3.


Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

caching strategies

  • cache aside: read and write are separate, easy for client to implement. But cache can be stale (solve with TTL), round trip to update cache if cache miss. can use LRU for eviction.
  • write through: synchronous process, cache is most up to date, but there can be delays with writing into db. data in cache may not be frequently accessed at all.
  • write back: asynchronous, data might be lost when saving into db.



Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.


Data loss.

Security vulnerabilities

system overload

single point of failure



Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

database replication

load balancer

rate limiting

distributed caching

horizontal scaling