Codemia | Master System Design Interviews Through Active Practice

My Solution for Designing a Simple URL Shortening Service: A TinyURL Approach

by solstice3420

System requirements

Functional:

1) Shorten URL Generation: specify mechanism how to generate short url from long one

2) Redirect user from short url to the long one

3) Read URL Details: create a user account, so that user can see all urls he created

4) Possibility to update/delete urls in the user account

5) URL Expiration Management: allow user to add expiration time. Set default TTL for urls of 5 years.

6) Save some analytic (for example for url clicks)

7) Premium users could have a shorter urls

8) Error Handling: Implement functions for error handling that return meaningful messages for common issues, such as expired URLs, accessing URLs that do not exist, or unauthorized access.

Non-Functional:

1) Performance: Specify response time targets for different functionalities, like maintaining a response time under 10ms for redirection and slightly longer for URL creation.

2) Scalability: Detail the expected growth in traffic and how the system would scale to accommodate more users and URL requests, e.g., horizontal scaling of the database and caching layers.

3) Security: Emphasize the need for secure data storage, encryption of sensitive data (e.g., user credentials), and protection against common web vulnerabilities (like XSS and SQL Injection).

Also nobody should guess any short url.

4) Availability: Specify the expected uptime percentage (e.g., 99.99% uptime) and strategies for achieving high availability, such as redundancy and failover mechanisms.

5) Data Consistency: Ensure strong consistency, particularly for critical operations like creating and redirecting URLs, so that changes are immediately visible.

Capacity estimation

** Suggesion **

- shortening:redirection request ratio is 1:100

- 200 mln new shorten urls request per month

- A URL shortening entry requires 500 Bytes of database storage.

- each entry will have a maximum of five years of expiry time, unless explicitly deleted.

- 100 million DAU

**Storage:**

200 mln/month * 12 month * 5 years = 12 bln URL shorten requests

12 bln * 500B = 6 TB

**Query per second:**

Redirects: 200 Million * 100=20 Billion

Seconds in a month: 30.42 days * 24 hours * 60 mins * 60 secs = 2628288 secs

QPS (creation or shortenings): 200 mln / 2628288 secs = 76 URLs/sec

QPS (redirects): 100 * 76 URLs/s=7600 URLs/s

**Bandwidth:**

Incoming badnwidth: 76URLs/sec * 500B * 8bits = 304 Kbps

Outgoing bandwidth: 7600URLs/sec * 500B * 8bits = 30400 Kbps

Total bandwidth = 304 Kbps + 30400 Kbps = 30.4 Mbps

**Memory estimation:**

20 percent of redirection requests generate 80 percent of the traffic

7600 * 3600 secs * 24 hrs = 0.66 bln

0.2 * 0.66 bln * 500B = 66 GB

**Servers neeeded:**

100 mln RPS/s/64000 RPS/s = 1.6K servers

## API design

- Shorten url: POST "/urls/shorten", body: {long_url, user_id, ttl}, response: {short_url, ttl}, status_codes: [201, 400, 429]

- Redirect url: GET "/short_url", status_codes: [302, 404, 429]

- Create user account: POST "/sign-up", body: {firstname, lastname, email, password }, response: {id, firstname, lastname, email}, status_codes: [201, 400, 429]

- Get user settings: GET "users/:id", response: {id, firstname, lastname, email}, status_codes: [200, 404, 429]

- Sign-in a user: POST "/sign-in", body: {email, password }, response: {id, firstname, lastname, email, jwt, refresh_token }, status_codes: [200, 400, 401, 429]

- Get user's urls: GET "/users/:id/urls", response: { urls: [{id, url}] }, status_codes: [200, 403, 404, 429]

- Update url for the user: PUT "/users/:id/urls/:id", body: {updated_url}, response: {updated_url}, status_codes: [200, 403, 404, 429]

- Delete url for user: DELETE "/users/:id/urls/:id", status_codes: [204, 403, 404, 429]

**Rate limiter:**

Anon. user:

shorten requests (10-50/ min)

redirects requests (100- 300/min)

Signed users:

shorten requests (100- 500/ min)

redirects requests (10000 - 30000/min)

Database design

**Schema Design:**

Urls.

id: SERIAL PRIMARY KEY

short_url: String

long_url: String

user_id: FOREIGN KEY, NULLABLE

created_at: ISO DateTime

updated_at: ISO DateTime

expired_at: ISO DateTime

Users

id: SERIAL PRIMARY KEY

firstname: String, NOT NULL

lastname: String, NOT NULL

email: UNIQUE String, NOT NULL

password: String, NOT NULL

**Indexing:**

1) B-tree index for short_url, to make long_url search faster

2) B-tree index for (short_url, user_id) to get all urls for user

**Scaling Strategy:**

1) reads scaling:

- Use Replica Sets – Replicate data across multiple nodes.

- Distribute Read Traffic – Route read queries to secondary nodes (eventually consistent reads).

2) writes scaling:

- Writes go to the Primary Node in a replica set.

- Sharding enables write scaling by distributing writes across shards

- Use Write Concern – Tune write consistency vs. performance - here we don't need data for reads right away, so we have enough time to update all replicas.

**Data Retention Policies:**

The url data could be stored for 5 years by default. After that we can run worker task that during the night will remove all

expired records. One month and one week before expiration data

**Backup and Recovery:**

A. Backups

1) Full Database Backup(daily)

2) Incremental Backups (Real-time or Hourly)

C. Backup Retention Policy

Short-term: Keep daily backups for 7-14 days.

Long-term: Keep weekly backups for 3-6 months.

D. Backup Storage Strategy

s3 storage

If there're in the system 3 replicas at least - then it will be easy to handle failovers.

B. Recovery Strategy

- Replica Set Failover (Automatic): If the primary fails, a secondary becomes the new primary.

- Full DB Restore (Manual): Deploy a fresh MongoDB instance and restore from the latest backup.

- URL Cache Recovery (Optional): If using Redis for caching, ensure Redis persistence (RDB or AOF).

**Security:**

Ensure that sensitive data (e.g., user information) is stored securely and that proper access controls are in place to protect the database.

Monitoring and Optimization:

Key Metrics to Monitor:

CPU and Memory Usage: Ensure your system has enough resources to handle requests.

Disk Space and I/O: Monitor disk usage and make sure that your database isn’t growing uncontrollably.

Slow Queries: Monitor slow queries and optimize them by creating indexes or revising queries.

Replica Lag: If using a replica set, track replication lag.

High-level design

### Key Components:

**API Gateway**

Handles:

- incoming requests and routes them to the appropriate service

- authenticate users

- cache requests

- hides behind the private VPC implementation of the app

- load balance the load

- SSL

- rate limiter

**Shortening Service:**

- generate short URLs and mapping them to long URLs

**Mapping Service(Redirection service):**

- Handles redirection requests from short URLs to long URLs

**Database**

- stores the mappings and any associated metadata (creation time, expiration, etc.)

- stores user account data

**Cache:** (e.g., Redis)

- For speeding up repeated lookup requests for popular short URLs

### Performance Considerations:

- Let's adding caching mechanisms like CDN for the most frequently accessed URLs at url redirect flow.

Frequently accessed short URLs can be cached at the edge using a Content Delivery Network (CDN). This minimizes latency by serving redirection responses from a location closer to the user, reducing round-trip times and lessening the load on backend services.

- Plan for load balancing to distribute requests across multiple instances of services:

1) Incoming Requests: A load balancer can distribute incoming client requests across multiple API gateway instances.

2) Service Instances: Within the system, use additional load balancing to distribute tasks among multiple instances of the URL shortening and mapping services.

### Scalability and Reliability:

**Stateless Design for Horizontal Scaling**

Stateless Components:

Ensure that key components (e.g., API Gateways, Shortening Service, Mapping Service) are stateless.

This means they do not store user session or request-specific data locally.

Instead, any stateful information is stored in centralized systems such as databases or distributed caches.

Stateless services allow you to easily add more instances to handle increased loads.

**External State Management:**

Offload state management to external systems (like databases or caching layers).

This keeps your service instances light and makes scaling horizontally as simple as spinning up more identical stateless nodes behind a load balancer.

**Redundancy and Failover Strategies**

Multiple Service Instances:

Deploy multiple instances of each service component.

This not only balances the load but also ensures that if one instance fails, others can continue serving requests.

**Load Balancing:**

Implement load balancers at various layers:

- Client-Facing Load Balancers: Distribute incoming traffic across multiple API gateway instances.

- Internal Load Balancers: Manage traffic distribution among service instances (e.g., for the Shortening Service or Mapping Service).

**Database Replication:**

Read Replicas/Multi-Master: Utilize database replicas to distribute read operations and provide redundancy.

In the event of a primary database failure, a replica can take over, reducing downtime.

**Failover Mechanisms:**

Set up automated failover strategies so that if the primary database becomes unavailable, the system switches to a standby replica seamlessly.

**Distributed Caching:**

Use a highly available distributed caching solution (e.g., Redis clusters) that supports replication. This ensures that even if one cache node fails, the cache remains available to serve frequent lookup requests.

**Autoscaling:**

Implement autoscaling policies based on metrics such as CPU utilization, memory usage, or request rates. This allows the system to automatically add or remove service instances in response to load changes.

**Health Checks and Monitoring:**

Regular Health Checks: Ensure that load balancers and orchestration tools perform continuous health checks to quickly detect and remove unhealthy instances.

Proactive Monitoring: Use monitoring tools to track performance metrics and set up alerts for anomalies. This proactive approach helps in quick detection of issues before they escalate.

Resilience Patterns

Circuit Breakers:

Implement circuit breaker patterns to handle failures gracefully. This prevents cascading failures by temporarily blocking calls to failing components and allowing the system time to recover.

**Service Discovery:**

Use a service discovery mechanism to dynamically manage the available service instances. This enables components to automatically locate healthy nodes, further enhancing system resilience.

Request flows

**Creating a Short URL:**

Client sends a request to POST /shorten with the long URL.

API Gateway forwards the request to the Shortening Service.

Shortening Service generates a short URL, stores it in the Database, and updates the Cache.

Returns the short URL to the client.

**Redirecting a Short URL:**

Client sends a request to GET /{shortUrl}.

API Gateway forwards the request to the Mapping Service.

Mapping Service checks the Cache for the short URL.

If found in Cache, return the long URL; if not, query the Database, update Cache, and return long URL.

Redirect the client to the long URL.

Detailed component design

The Mapping Service takes the short URL and redirects the client to the corresponding long URL.

The service uses 302 HTTP status for redirect, because it will reuse result for analytic(302 doesn't use cache redirect).

How It Works:

When a request comes in with a short URL, it first checks the Cache to see if the mapping exists.

If the mapping is found in the Cache, it returns the long URL immediately.

If not found (cache miss), it queries the Database for the mapping, caches the result, and returns the long URL.

Scalability:

Caching: Utilizing caching mechanisms significantly improves redirect response times and reduces the load on the Database.

Replication: You can scale read capacities by implementing read replicas of the Database.

Algorithms/Data Structures:

Cache: A key-value store (like Redis) uses an in-memory data structure for fast access.

LRU Cache: Implement the Least Recently Used (LRU) eviction policy to efficiently manage cache memory.

1) What to Cache?

Common Caching Use Cases:

- Short URL → Long URL Mapping (to avoid frequent DB queries).

- Rate Limit Data (to prevent abuse with minimal lookup overhead).

- API Responses (for analytics dashboards or reports).

2) Types of Caching

- In-Memory Caching (Fastest, but Limited Size)

Use Cases: Store frequently accessed short URLs.

Technology: Redis, Memcached.

Pros: Extremely fast, reduces DB hits.

Cons: Limited by memory, risk of eviction.

- Distributed Caching (Scalable, Fault-Tolerant)

Use Cases: Multi-instance applications (cloud-based).

Technology: Redis Cluster, AWS ElastiCache, Google Cloud Memorystore.

Pros: Scales with demand, supports replication.

Cons: Requires setup/management.

3) Cache Expiration & Eviction Policies

- Expiration Policies (TTL)

Short URLs: Set a TTL of 24 hours for frequently accessed links.

Rate Limit Data: TTL of 1 minute to enforce request limits.

API Responses: Cache for 5–10 minutes (if data doesn’t change frequently).

- Eviction Policies

LRU (Least Recently Used): Remove the least accessed URLs first.

LFU (Least Frequently Used): Removes URLs with the lowest access count.

TTL-Based Eviction: Automatically deletes expired keys.

Caching Technology Recommendations: Redis (Best for URL Mapping)

Supports TTL-based eviction.

Can store short URL → long URL mappings with minimal overhead.

Summary:

- Redis for fast short URL lookups.

- TTLs to avoid stale cache data.

- Implement LRU eviction to prevent memory overuse.

- Cache rate limit checks for efficient abuse prevention.

Trade offs/Tech choices

1. SQL vs. NoSQL Database

Trade-off:

Scalability:

SQL: Vertical scaling (sharding required for horizontal scaling)

NoSQL: Naturally supports horizontal scaling (partitioning, sharding).

Consistency:

SQL: Strong consistency (ACID transactions).

NoSQL: Eventual consistency (default in NoSQL systems), but some NoSQL databases offer strong consistency at the cost of performance.

Read Performance:

SQL: Optimized for structured queries (indexed searches, joins).

NoSQL: Faster for simple key-value lookups but can suffer from stale data in eventually consistent models.

Write Performance:

SQL: Slower due to transaction overhead.

NoSQL: High throughput due to distributed architecture, but potential latency due to replication.

Flexibility:

SQL: Schema-bound, requires migrations for structure changes.

NoSQL: Schema-less, allows dynamic updates to records without migrations

Choice: NoSQL Database (e.g., MongoDB or DynamoDB)

Rationale: In a URL shortening service, the schema is relatively simple (short URL, long URL mapping), and there is a requirement for high write throughput (especially during peak usage). NoSQL databases can efficiently handle large numbers of concurrent users.

2. Uniform Random String vs. Base Conversion for Short URL Generation

Trade-off:

Random String Generation: Simplicity and ease of implementation, but potential for collisions (needing additional checks).

Base Conversion: More complex but guarantees uniqueness by mapping incremental IDs to short strings.

Choice: Random String Generation with Collision Checking

Rationale: This approach allows more straightforward URL shortening without needing a dedicated mapping algorithm. Although it requires additional checks for collisions, the complexity is manageable for a service starting with moderate traffic, and the benefits of simplicity weigh in favor of this choice.

3. Caching Strategy:

What to Cache?

Trade-off: Caching API responses improves performance but risks serving stale data if updates are frequent.

Trade-off: Storing rate limit data in cache reduces database lookups but introduces potential inconsistency if the cache is evicted unexpectedly.

Types of Caching

**In-Memory Caching:**

Trade-off: Extremely fast but constrained by memory limits, leading to possible evictions.

**Distributed Caching:**

Trade-off: More scalable and fault-tolerant but requires additional setup and management overhead.

**Cache Expiration & Eviction Policies**

Trade-off: TTL-based expiration prevents stale data but requires careful tuning to avoid unnecessary misses.

Trade-off: LRU/LFU policies manage memory effectively but may evict useful data if access patterns are miscalculated.

**Cache Monitoring**

Trade-off: Constant monitoring improves reliability but adds operational complexity.

Trade-off: Setting strict eviction policies can reduce memory usage but may lead to increased cache misses.

4. API Gateway

Trade-off:

Self-Managed API Gateway: Gives complete control over the infrastructure but adds operational overhead.

Managed API Gateway (e.g., AWS API Gateway): Reduces operational burden but may involve vendor lock-in and costs.

Choice: Managed API Gateway

Rationale: A managed service simplifies scaling, security, and maintenance while allowing development teams to focus on core functionality. The trade-off of being dependent on a specific cloud provider is outweighed by reduced operational complexity.

5. CDN for Serving Redirects

Trade-off:

Using a CDN: Increases cost and complexity but significantly improves performance and scalability for high-volume URLs.

Serving All Redirects from Origin: Simplifies architecture and reduces cost but risks slower response times and higher server load.

Choice: Use of CDN

Rationale: Deploying a CDN improves the user experience through faster redirection to long URLs, especially for popular links that could go viral. Given the potential for high traffic, the benefits of faster response times and reduced load on the API Gateway and Mapping Service justify the costs.

6. "dedicated unique ID service" or checking "existing URLs in a data store"

The case ultimately depends on your specific requirements for performance, flexibility, and architecture complexity.

If minimizing latency and ensuring absolute uniqueness with less operational complexity is a priority, the unique ID service is preferable.

If maintaining flexibility, managing existing entries, and avoiding the complexity of a separate services architecture are more critical, then using a data store to check for existing short URLs would be the better choice.

Failure scenarios/bottlenecks

### Failure Scenarios

a. **Service Outages**

Potential Issue: Any of the microservices (Shortening Service, Mapping Service, or API Gateway) could fail due to unplanned outages, service crashes, or network issues.

Mitigation Strategies:

- Redundancy: Run multiple instances of each service to provide high availability.

- Health Checks: Implement health checks to detect failures and reroute traffic to healthy instances.

- Service Monitoring and Alerts: Use monitoring tools to track service health and set up alerts for failures.

b. **Database Failures**

Potential Issue: The database could become unavailable due to server issues, network partitioning, or heavy load leading to performance degradation.

Mitigation Strategies:

- Database Replication: Use read replicas to distribute read load and ensure redundancy.

- Automated Backups and Recovery: Implement automated backups to avoid data loss and enable recovery in case of failure.

- Caching: Utilize cache (e.g., Redis) for high-frequency reads to reduce database load.

c. **Cache Failures**

Potential Issue: If the caching layer (e.g., Redis) fails or becomes unavailable, it can significantly slow down redirection requests.

Mitigation Strategies:

Graceful Degradation: Fallback to querying the database directly if the cache is unavailable, though this may impact performance.

Replication: Use cluster setups for Redis, allowing for high availability and failover in case of failure.

### Bottlenecks

a. **High Traffic Loads**

Potential Issue: If the service experiences a sudden spike in traffic due to a viral link, it might overwhelm the API Gateway or the Mapping Service.

Mitigation Strategies:

Load Balancing: Distribute incoming requests across multiple service instances to avoid overload on any single instance.

Rate Limiting: Implement rate limiting to control the number of requests a user can make within a specific timeframe, preventing abuse and service degradation.

b. **Database Write Bottlenecks**

Potential Issue: The database may become a bottleneck during high-volume short URL creation requests, leading to delayed responses.

Mitigation Strategies:

Message Queue: Introduce a message queue to handle short URL creation requests asynchronously. The Shortening Service can enqueue requests, while background workers process them for database writes.

Batch Writes: Optimizing writes to the database by batching multiple insert operations can significantly reduce write load.

c. **Network Latency**

Potential Issue: Increased network latency can occur between services, especially in a microservices architecture.

Mitigation Strategies:

Local Caching: Cache frequently accessed mappings locally within the services to minimize cross-service calls.

Geographic Distribution: Deploy services closer to your user base to reduce network round-trip times.

Future improvements

Implementing User Authentication and Custom Short URLs:

**Improvement: Allow users to create accounts and manage their short URLs. Users can have the option to create custom aliases for their short URLs.**

Benefits: Personalization improves user engagement, and authenticated users can manage their links better, including analytics and statistics on link performance.

Link Expiration and Deletion:

**Improvement: Introduce features for users to set expiration dates for short URLs or allow for manual deletion.**

Benefits: Users can maintain cleaner link management, which could help in compliance with data privacy regulations or user preferences.

Analytics Dashboard:

**Improvement: Provide an analytics dashboard that shows users statistics such as click-through rates, geographic distribution of users, referrer statistics, and device usage.**

Benefits: These insights can help users make informed decisions about their links, maximizing engagement by optimizing their sharing strategies.

Enhanced Security Features:

**Improvement: Introduce features like link verification, which would check for malicious content, and implement additional security protocols to protect against abuses (e.g., phishing links).**

Benefits: Improves trust in the service, and assures users that the links are safe to click.

Multi-Regional Deployment:

**Improvement: Deploy the service across multiple geographic regions to reduce latency for users in those areas.**

Benefits: Improves response times and ensures high availability, even during a traffic spike.

Automated Scaling:

**Improvement: Implement auto-scaling solutions for services (both backend and frontend) based on traffic load, utilizing cloud features if deployed on platforms like AWS, GCP, or Azure.**

Benefits: Reduces manual intervention needed for scaling, ensuring the service remains responsive during high demand.

Advanced Caching Strategies:

**Improvement: Integrate more advanced caching strategies, such as tiered caching (local caches, remote caches, and CDN), to further reduce database load and improve performance.**

Benefits: This will enhance response times and improve overall application performance.

Mitigating Failure Scenarios

Improving Service Failures:

**Implement health check mechanisms and circuit breakers to quickly detect when a service is down and reroute traffic to alternative service instances. Use service mesh technologies for better resilience and monitoring.**

Database Failures:

**Adopt multi-region database replication techniques to ensure data is available even if one database instance fails. Implement automatic failover strategies if the primary database goes down.**

Load Management:

**Introduce a dynamic load balancer that automatically adjusts based on current traffic, ensuring even distribution of requests across service instances and avoiding bottlenecks.**

For potential spikes, utilize a queue-based architecture to handle peaks in the Request load, where requests are queued and processed as resources become available.

Cache Failures:

**Use cache fallback mechanisms to prevent the application from failing when the cache is down. If a cache cannot be reached, the system should still function by querying the database directly.**

Advanced Monitoring and Alerts:

**Implement detailed logging and monitoring capabilities with real-time alerts. This could include application performance monitoring (APM) tools to quickly detect anomalies and respond effectively before they lead to significant failures.**