Designing A Simple Url Shortening Service A TinyURL Approach - System Design

Requirements

Functional Requirements:

Create a short URL for a given long URL.
Return the long URL associated with a given short URL.

Non-Functional Requirements:

Scalability: As your service grows, it should be able to handle increased loads by adding more servers rather than upgrading existing ones.
Performance: Users expect quick redirects when accessing shortened URLs.
Reliability: The service should remain operational and accessible at all times.

API Design

API Overview
1. Purpose: Provide a service to shorten long URLs and redirect users from short URLs to their corresponding long URLs.
2. Versioning: Use versioning in the URL (e.g., /api/v1/shorten).
Endpoints
1. POST /api/v1/shorten
  1. Description: Create a short URL for a given long URL
  2. Request Body:

{
  "longUrl": "https://example.com/some/long/url"
}

1. GET /api/v1/redirect/{shortUrlId}
  1. Description: Redirect users to the long URL associated with the given short URL
  2. Response: 302 Redirect to the long URL
Data Modeling
1. URL Mapping Table:
  1. Columns: id (primary key), short_url, long_url, created_at, updated_at
  2. Indexing: Create an index on short_url for fast lookups
Caching Layer
1. Use caching (e.g., Redis) to store frequently accessed short URL mappings to reduce latency.
Error Handling
1. 400 Bad Request: Invalid long URL format.
2. 404 Not Found: Short URL not found.
Rate Limiting
1. Implement rate limiting per IP to prevent abuse.

High-Level Design

Architectures
1. API Layer
  1. API Gateway: Handles incoming requests and routes them to service layer
  2. Caching service: It temporarily stores frequently accessed data, such as the mappings between short URLs and their corresponding long URLs. By doing so, it reduces the need to repeatedly query the underlying database, which can be a bottleneck, especially during high traffic periods.
2. Service Layer
  1. Load Balancer: Route requests to different instances of Shortener Service to prevent overload
  2. Shortener Service: Manages URL creation and retrieval
3. Data Layer
  1. Database: Stores URL mappings and analytics data
Data Flows
1. URL creation: User submits shortening requests with a long URL to API Gateway, which routes the request to Shortener Service, which generates a short URL, stores the mapping to Database, and returns the short URL to User
2. URL Redirection: User submits redirecting requests with a short URL to API Gateway, which routes the request to Shortener Service, which retrieves the long URL from Database, and redirects the user
Identifier Generation: We need to ensure the generated short URLs are unique and avoid collisions.
1. Sequential IDs: This approach generates identifiers in a linear fashion (e.g., 1, 2, 3, ...). While simple, it can lead to predictable URLs that are easy to guess, which is a security concern. Additionally, it may create bottlenecks in high-concurrency environments.
2. Random Strings: Generating random alphanumeric strings (e.g., using UUIDs or a base-62 encoding) can help create unique identifiers that are hard to guess. This method reduces the likelihood of collisions but requires a mechanism to ensure uniqueness, which can add complexity.
3. Hashing: Hashing the long URL using algorithms like SHA-256 and then truncating the result can also create unique identifiers. However, this may lead to collisions, necessitating a check against existing identifiers.
4. Custom Patterns: If we want to allow users to specify custom URLs, it requires additional logic to validate and ensure uniqueness across the identifier space.
5. I recommend random strings here. In case a collision happens, we can keep generating random strings, until there is no collision. This ensures the uniqueness of the generated identifier and avoid security concern from sequential IDs.

Database sharding
1. Distribute data across multiple database instances by short URL prefix

Detailed Component Design

API Gateway

Description: Handles incoming requests, and retrieves data from caching service for popular queries, or routes to Service Layer
APIs
1. POST /api/v1/shorten
  1. Description: Create a short URL for a given long URL
  2. Request Body:

{
  "longUrl": "https://example.com/some/long/url"
}

1. GET /api/v1/redirect/{shortUrlId}
  1. Description: Redirect users to the long URL associated with the given short URL
  2. Response: 302 Redirect to the long URL

Caching Service

Description: Responsible for storing popular queries (retrieving longUrl redirects)
APIs
1. GET /{shortUrl}
  1. Output:

{
  "longUrl": "https://example.com/some/long/url"
}

1. 1. The underlying storing system should utilize Time-To-Live expiration strategy and cache purging to remove stale data
  2. Eviction policy: Can be Least Recently used or Least Frequently used

Load Balancer

Description: Route requests to different instances of Shortener Service based on different conditions like geographical nearest, numbers of instances spawned, etc.

Shortener Service

Description: Responsible for creating and retrieving short URLs
Responsibilities
1. Generate unique short URLs
2. Map short URLs to long URLs
APIs
1. POST /shorten
  1. Input:

{
  "longUrl": "https://example.com/some/long/url"
}

1. 1. Output:

{
  "shortUrl": "abc123"
}

1. 1. Design: It uses the aforementioned Identifier Generation Algorithm to generate shortUrl and store URL mappings to the Database.
2. GET /{shortUrl}
  1. Output:

{
  "longUrl": "https://example.com/some/long/url"
}

1. 1. Design: It retrieves the data entry from Database using shortUrl as index. Returns 404 for shortUrl not found.
Identifier Generation: We need to ensure the generated short URLs are unique and avoid collisions.
1. Sequential IDs: This approach generates identifiers in a linear fashion (e.g., 1, 2, 3, ...). While simple, it can lead to predictable URLs that are easy to guess, which is a security concern. Additionally, it may create bottlenecks in high-concurrency environments. This is not recommended.
2. Random Strings: Generating random alphanumeric strings (e.g., using UUIDs or a base-62 encoding) can help create unique identifiers that are hard to guess. This method reduces the likelihood of collisions but requires a mechanism to ensure uniqueness, which can add complexity.
3. Hashing: Hashing the long URL using algorithms like SHA-256 and then truncating the result can also create unique identifiers. However, this may lead to collisions, necessitating a check against existing identifiers.
4. Custom Patterns: If we want to allow users to specify custom URLs, it requires additional logic to validate and ensure uniqueness across the identifier space.
5. I recommend random strings here. In case a collision happens, we can keep generating random strings, until there is no collision. This ensures the uniqueness of the generated identifier and avoid security concern from sequential IDs.
Non-Functional Requirements
1. Scalability: Can scale horizontally by adding more instances, or vertically by upgrading existing instances
2. Performance: Response time should be under 100ms for URL shortening
3. Reliability: Redundant instances behind a load balancer
  1. Generator Outrages: In case the generator service is unavailable, we could implement a fallback mechanism like maintaining a local cache of recently generated IDs or use a secondary ID generation service.
  2. Split-Brain: This occurs when network partitions occur, causing instances to operate independently without knowledge of each other and generate colliding IDs. We could incorporate a shard-based approach where each node is responsible for a specific range of IDs, reducing the risk of collisions.
4. Consistency: Read operations are consistent. If there is a collision during write operation, where Data Layer returns a collision error, we implement retry mechanism to generate new identifier.

Database

Description: Stores URL mappings and analytics data
Data modeling:
1. URL mapping table
  1. Columns: id (primary key), short_url, long_url, created_at, updated_at
  2. Indexing: Create an index on short_url for fast lookups
2. Analytic data
  1. User metrics
    1. Popular shortUrl queries
    2. User actions (URL creation vs. URL retrieval)
    3. Geographic data
  2. Performance metrics
    1. Response time
    2. Error rates
  3. A/B testing data
Non-functional requirements
1. Scalability: Implement database sharding to distribute data across different instances
2. Performance: Recommend using non-relational database like NoSQL to store data, since queries are simple and require low latency and high scalability
3. Reliability: Database replicas can help recover when one instance is down
4. Consistency: Read operations can be concurrent. Write operations can be concurrent among different database shards.