Requirements
Functional Requirements:
- Create a short URL for a given long URL.
- Return the long URL associated with a given short URL.
- Links must be able to support analytics.
- URL should not be easily reversed engineered such as incrementing or UUID where it is time dependent
Non-Functional Requirements:
- High availability
- Low latency
- Horizontal scalability
- Assuming a medium-scale system, with about 1B users, and 10% of which are active, which is 100M users, each creating about 3 request/minute on average, we can expect 300 request/minute, or roughly 5 request/second. Considering the longevity of the system for 100 years, we can expect 5*60*60*365*100 = 657M possibly unique URL.
API Design
Define the APIs expected from the system. This is your chance to analyze and define the read and write paths so that you can come up with the high-level design...
Domain name: trnc.io
Available endpoints:
POST trnc.io/api/v1
{
url: 'https://codemia.io"
}
returns {url: 'trnc.io/api/v1/cmio'}
GET trnc.io/api/v1/cmio
302 redirect to https://codemia.io
GET trnc.io/api/v1/analytics
(internal analytics use)
High-Level Design
Describe the overall system architecture. Identify the main components needed to solve the problem end-to-end. Use the diagramming tool to create a block diagram.
API gateway:
- Pros: manage multiple service. Authentication. Rate limiting. Load balancing.
- Cons: single point of failure, additional service to manage
Services:
- Scalable, each with connection to DB
MongoDB:
- Write only to main mongoDB
- Read from both main mongoDB and secondary mongoDB
Detailed Component Design
Deep dive into 2-3 key components. Explain how they work, how they scale, discuss tradeoffs, capacity, and any relevant algorithms or data structures.
- Services: split shorten-url service and retrieve url service. Service will likely be heavily read. So separating these services will allow us to scale them independently.
- Hashing algorithm: Precompute 1B unique keys and store than in the DB instead of using incremental ids as that is predictable.
- Other potential implementation:
- Incremental ID + random increments
- Pros: easy to implement
- Cons: Predictable
- Hashing the url
- Pros: easy to implement and deterministic
- Cons: does not fulfil the requirement of analytics where we need unique shortened urls to help keep track
- random generation
- Pros: simple implementation
- Cons: collision, but trade off alright since collision is rare
- Incremental ID + random increments
- Other potential implementation:
- Database:
- Choice: NoSQL db like mongoDB
- Strategy: 1 primary, 1 secondary backup
- Based on our defined non-functional requirement regarding the requests, 1 primary noSQL db will be able to support the request. However, we still choose to have a secondary DB to allow for high availability in case of failure. Seek for simplicity instead.
- Other potential implementation:
- Consistent hashing
- Pros: if we needed multiple databases, then we are able to shard evenly
- Cons: Extra complexity. Unbalanced databases (solved with virtual servers)
- Cache only db
- Pros: lightning fast, support twice as fast as noSql
- Cons: mem only, lose data
- Consistent hashing
- URL length: There are 26 characters in the english alphabet, multiply by two for capitalised letters, add an additional 10 for numbers and we have 56 characters. To determine the length of the url, we can reverse engineer by using log(657000000)/log(56) ~ 6 characters long, which can create 30,840,979,456 requests. Approximately 4690 years. By using 6 characters length, we are able to support the storage required for a base62 encoding.
- Status code: 302 to force client to send request to our servers so that we can support analytics. Downside is that we will receive more requests, but we will be able to support this with our design.