System requirements


Functional:

Url should be unique

Given a long url, system should give a short url

system should redirect to the original url given the short url




Non-Functional:

Highly available

Looks like an AP system than consistency.

count the number of times the url is accessed




Capacity estimation

Per Url - 200BYTES and user metadata 100 bytes

replicas - 3


total metadata = ~1000 bytes per day


400000 bytes per year


a million urls


10^6*4*10^5 = 4*10^11


5 years storage = 20*10^11 = 2 *10^12 = 2TB




API design

Define what APIs are expected from the system...


shortenUrl(longurl, userID, timestamp)

getUrl(shortUrl)





Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...


its definitely a read heavy state,

Need a key value store

to optimize storage, we can store the short url top domain in another column


The issue with noSQL though is that they might not be consistent and might cause conflicts so we can either go with sql assuming the shortening algorithm will give a unique url always



High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design...



Database - Use mongodb

Short URL generator - hashing algorithm - base 68 encoder,

loadbalancer

cache

replication

rate limiting



Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...


Overall, the request first comes to the shorturlgenerator that shortens the url using a sequencerid and encodes that to a base 58 ecncoding.


When the redirection request comes, the request first goes to the loadbalancer which routes the request to the api server which searches the cache first and then the db if not found to get the long url and redirects to the long url


Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...


the main component to be focused here is the urlshortner


When the request for url shortner comes, it first hashed and then encodes to base 58. Base 58 encoding is for readability. But this can cause collisions.


So we need a very strong hashing algorithm like murmur, google to avodi collisions.


Also, to avoid collisions, we can check the db if the hash already exists but again this might slo us down so we shard the db with a range function and also can use bloom filters


Scaling the db and caching are important here


Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

used mongo as its nosql and gives high throughput and locking emchanisms to avoid collisions


1. Algorithms and Data Structures for Unique Short URL Generation

  • UUID Generation: Instead of relying solely on hashing, using Version 4 UUIDs (random) can virtually eliminate collision risks due to their vast space. However, UUIDs are typically long; thus, a custom encoding to a shorter base, such as base-58 or base-62, is necessary to maintain the URL shortening service's purpose.
  • Consistent Hashing: Especially useful in distributed systems, consistent hashing can help in evenly distributing URLs across multiple nodes, reducing hotspots and improving load balancing. It also simplifies adding or removing nodes from the system without significant rehashing.
  • Distributed Unique ID Generation: Techniques such as Twitter’s Snowflake algorithm can generate unique IDs in a distributed environment without coordination between nodes, ensuring uniqueness and high availability. Snowflake IDs are composed of a timestamp, a node ID, and a sequence number, which together guarantee a unique identifier.

2. Detailed Component Design and Scalability

  • Caching Strategies: Implement an LRU (Least Recently Used) cache to store frequently accessed URLs and their corresponding short codes. For distributed caching, consider using Redis or Memcached, which can handle high read/write speeds and scale horizontally.
  • Load Balancing Mechanisms: Utilize a combination of DNS round-robin and dynamic load balancers (like NGINX or HAProxy) that can monitor the health and traffic load of servers and distribute requests accordingly. Implementing SSL termination at the load balancer level can also offload encryption tasks from web servers, improving performance.
  • Replication Techniques: For database replication, use a master-slave configuration where writes are directed to the master database and reads are distributed among multiple slave databases. This can be coupled with sharding to distribute the data across different databases based on a shard key, such as the hash of the URL.

3. Database Choice: Trade-offs Between NoSQL and SQL

  • Consistency vs. Availability: NoSQL databases, like MongoDB, offer high availability and scalability, fitting well with the AP (Availability and Partition Tolerance) requirements of a URL shortening service. However, they may sacrifice consistency (eventual consistency) which could be critical depending on the application's requirements.
  • SQL Databases: SQL databases, like MySQL or PostgreSQL, provide strong consistency and relational data integrity. They're suitable for applications where transactional integrity (ACID properties) is crucial. However, they might face scalability challenges in a distributed environment compared to NoSQL solutions.
  • Trade-off Justification: The choice between NoSQL and SQL databases should be based on the system's specific requirements for consistency, scalability, and availability. For a URL shortening service, where high availability and the ability to handle large volumes of data are paramount, a NoSQL database like MongoDB might be preferred. However, if transactional integrity and relational data modeling are more critical, an SQL database could be the better choice.

4. Recommendations for Improvement

  • Algorithms/Data Structures: Incorporate a combination of UUID generation and consistent hashing to ensure unique short URL generation without collisions.
  • System Components: Provide detailed explanations of the roles and interactions of caching, load balancing, and database replication in the system's architecture, focusing on scalability and fault tolerance.
  • Database Trade-offs: Offer a comprehensive analysis of the trade-offs involved in the database selection process, considering the specific needs of the URL shortening service for consistency, scalability, and availability.




Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.






Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?