Tackling System Design Interview Problems
Database Design
Choosing the right database is one of the most critical steps in system design, as it directly impacts scalability, performance, and maintainability. The choice begins with capacity estimation and understanding the system's use cases. For small-scale systems, relational databases like PostgreSQL or MySQL often suffice due to their robust query capabilities and strong consistency. However, for larger-scale systems handling significant data volumes (e.g., 1 TB to 100 TB) or high write throughput, distributed NoSQL databases like Cassandra, DynamoDB, or Bigtable may be necessary.
Equally important is data modeling, which determines how data is structured and accessed. Data modeling can help identify optimal partitioning keys, normalization, denormalization opportunities, and trade-offs between performance and storage. The level of detail invested in data modeling should be proportional to its impact on the design.
Database Choice
The primary data model for a URL shortener is a mapping between short and long URLs. This system has:
- Simple Queries: Most operations involve basic lookups or inserts.
- Strong Consistency Requirements: Once a mapping is written, it must be immediately visible to all readers.
- Read-Heavy Traffic: Far more redirects (reads) occur than URL generations (writes).
A key-value store like DynamoDB is a natural fit. DynamoDB provides horizontal scalability, high performance, and optional strong consistency.
Data Model
URL Mapping
| Field | Description |
| short_url | Primary key (8 bytes) |
| long_url | Indexed (100 bytes) |
| created_time | Timestamp (8 bytes) |
| created_by | User ID (optional, 20 bytes, indexed) |
| expiration_time | Timestamp (8 bytes) |
Additional secondary indexes can support features like:
- Listing all short URLs created by a user.
- Expiring short URLs based on expiration_time.
Initial Insights
- Caching: We could use Redis to cache frequently accessed short-long mappings, reducing database load.
- URL Generation:
- We could use Base-62 encoding to generate compact, user-friendly short URLs.
- Implement collision handling for cases where the same short URL is generated more than once.
Database Choice
Twitter’s data characteristics include:
- Small Row Sizes: Individual tweets are small (e.g., ~512 bytes).
- High Write Volume: Billions of tweets are generated daily.
- Relational Query Needs: Queries like “list tweets by a user” require relationships.
- Eventual Consistency: Strong consistency is not strictly necessary for tweets.
Given these requirements:
- LSM-Based Databases (e.g., Cassandra): Ideal for write-heavy workloads and horizontal scaling.
- B-Tree-Based Databases (e.g., MySQL, MongoDB): Better for relational queries but may face scalability challenges with high write volumes.
A compromise might be a document-oriented database like MongoDB. MongoDB is more horizontally scalable than relational databases, supports eventual consistency to enhance scalability, and offers better relational query capabilities than Cassandra.
Data Models
Tweet
| Column | Description |
| tweet_ID | Primary key |
| created_by | User ID (indexed) |
| posted_time | Timestamp (indexed) |
| content | Text content of the tweet |
| number_of_likes | Counter for likes |
| hashtags | List of hashtags associated with the tweet |
| users_mentioned | List of user IDs mentioned in the tweet |
User
| Field | Description |
| user_ID | Primary key |
| Indexed | |
| name | Full name |
Hashtag
| Field | Description |
| hashtag_ID | Primary key |
| Indexed | |
| name | Full name |
| nickname | Display name |
The Hashtag model uses denormalization to optimize the core functionality: mapping hashtags to associated tweets. In a relational database, this would require a foreign key relationship and query through the tweets table, whereas MongoDB’s schema flexibility enables faster lookups at the cost of managing consistency (e.g., removing tweet IDs from the hashtag list when tweets are deleted).
Follows Relationship
The Follows model is normalized for efficient querying:
| Field | Description |
| follower | User ID of the follower |
| followed | User ID of the user being followed |
| timestamp | Time of follow action |
This separation allows efficient queries for both:
- Listing users a given user is following.
- Listing followers of a given user.
It can be tempting to list every possible field and model exhaustively. However, this approach often wastes valuable time on details that may not significantly impact the final design. Instead, focus on gaining insights into how the data should be partitioned and normalized/denormalized to optimize queries for the system's specific needs. The goal is to understand the data flow and access patterns, enabling you to make design decisions that enhance performance and scalability without getting bogged down in unnecessary details.