Detailed Component Design
API Gateway
- Description: Handles incoming requests, and retrieves data from caching service for popular queries, or routes to Service Layer
- APIs
- POST /api/v1/shorten
- Description: Create a short URL for a given long URL
- Request Body:
{
"longUrl": "https://example.com/some/long/url"
}
- GET /api/v1/redirect/{shortUrlId}
- Description: Redirect users to the long URL associated with the given short URL
- Response: 302 Redirect to the long URL
Caching Service
- Description: Responsible for storing popular queries (retrieving longUrl redirects)
- APIs
- GET /{shortUrl}
- Output:
{
"longUrl": "https://example.com/some/long/url"
}
- The underlying storing system should utilize Time-To-Live expiration strategy and cache purging to remove stale data
- Eviction policy: Can be Least Recently used or Least Frequently used
Load Balancer
- Description: Route requests to different instances of Shortener Service based on different conditions like geographical nearest, numbers of instances spawned, etc.
Shortener Service
- Description: Responsible for creating and retrieving short URLs
- Responsibilities
- Generate unique short URLs
- Map short URLs to long URLs
- APIs
- POST /shorten
- Input:
{
"longUrl": "https://example.com/some/long/url"
}
- Output:
- Design: It uses the aforementioned Identifier Generation Algorithm to generate shortUrl and store URL mappings to the Database.
- GET /{shortUrl}
- Output:
{
"longUrl": "https://example.com/some/long/url"
}
- Design: It retrieves the data entry from Database using shortUrl as index. Returns 404 for shortUrl not found.
- Identifier Generation: We need to ensure the generated short URLs are unique and avoid collisions.
- Sequential IDs: This approach generates identifiers in a linear fashion (e.g., 1, 2, 3, ...). While simple, it can lead to predictable URLs that are easy to guess, which is a security concern. Additionally, it may create bottlenecks in high-concurrency environments. This is not recommended.
- Random Strings: Generating random alphanumeric strings (e.g., using UUIDs or a base-62 encoding) can help create unique identifiers that are hard to guess. This method reduces the likelihood of collisions but requires a mechanism to ensure uniqueness, which can add complexity.
- Hashing: Hashing the long URL using algorithms like SHA-256 and then truncating the result can also create unique identifiers. However, this may lead to collisions, necessitating a check against existing identifiers.
- Custom Patterns: If we want to allow users to specify custom URLs, it requires additional logic to validate and ensure uniqueness across the identifier space.
- I recommend random strings here. In case a collision happens, we can keep generating random strings, until there is no collision. This ensures the uniqueness of the generated identifier and avoid security concern from sequential IDs.
- Non-Functional Requirements
- Scalability: Can scale horizontally by adding more instances, or vertically by upgrading existing instances
- Performance: Response time should be under 100ms for URL shortening
- Reliability: Redundant instances behind a load balancer
- Generator Outrages: In case the generator service is unavailable, we could implement a fallback mechanism like maintaining a local cache of recently generated IDs or use a secondary ID generation service.
- Split-Brain: This occurs when network partitions occur, causing instances to operate independently without knowledge of each other and generate colliding IDs. We could incorporate a shard-based approach where each node is responsible for a specific range of IDs, reducing the risk of collisions.
- Consistency: Read operations are consistent. If there is a collision during write operation, where Data Layer returns a collision error, we implement retry mechanism to generate new identifier.
Database
- Description: Stores URL mappings and analytics data
- Data modeling:
- URL mapping table
- Columns: id (primary key), short_url, long_url, created_at, updated_at
- Indexing: Create an index on short_url for fast lookups
- Analytic data
- User metrics
- Popular shortUrl queries
- User actions (URL creation vs. URL retrieval)
- Geographic data
- Performance metrics
- Response time
- Error rates
- A/B testing data
- Non-functional requirements
- Scalability: Implement database sharding to distribute data across different instances
- Performance: Recommend using non-relational database like NoSQL to store data, since queries are simple and require low latency and high scalability
- Reliability: Database replicas can help recover when one instance is down
- Consistency: Read operations can be concurrent. Write operations can be concurrent among different database shards.