Database Design

Choosing the right database is one of the most critical steps in system design, as it directly impacts scalability, performance, and maintainability. The choice begins with capacity estimation and understanding the system's use cases. For small-scale systems, relational databases like PostgreSQL or MySQL often suffice due to their robust query capabilities and strong consistency. However, for larger-scale systems handling significant data volumes (e.g., 1 TB to 100 TB) or high write throughput, distributed NoSQL databases like Cassandra, DynamoDB, or Bigtable may be necessary.

Equally important is data modeling, which determines how data is structured and accessed. Data modeling can help identify optimal partitioning keys, normalization, denormalization opportunities, and trade-offs between performance and storage. The level of detail invested in data modeling should be proportional to its impact on the design.

Short URL Service

Database Choice

The primary data model for a URL shortener is a mapping between short and long URLs. This system has:

Simple Queries: Most operations involve basic lookups or inserts.
Strong Consistency Requirements: Once a mapping is written, it must be immediately visible to all readers.
Read-Heavy Traffic: Far more redirects (reads) occur than URL generations (writes).

A key-value store like DynamoDB is a natural fit. DynamoDB provides horizontal scalability, high performance, and optional strong consistency.

Data Model

URL Mapping

Field	Description
short_url	Primary key (8 bytes)
long_url	Indexed (100 bytes)
created_time	Timestamp (8 bytes)
created_by	User ID (optional, 20 bytes, indexed)
expiration_time	Timestamp (8 bytes)

Additional secondary indexes can support features like:

Listing all short URLs created by a user.
Expiring short URLs based on expiration_time.

Initial Insights

Caching: We could use Redis to cache frequently accessed short-long mappings, reducing database load.
URL Generation:
- We could use Base-62 encoding to generate compact, user-friendly short URLs.
- Implement collision handling for cases where the same short URL is generated more than once.

Twitter

Database Choice

Twitter’s data characteristics include:

Small Row Sizes: Individual tweets are small (e.g., ~512 bytes).
High Write Volume: Billions of tweets are generated daily.
Relational Query Needs: Queries like “list tweets by a user” require relationships.
Eventual Consistency: Strong consistency is not strictly necessary for tweets.

Given these requirements:

LSM-Based Databases (e.g., Cassandra): Ideal for write-heavy workloads and horizontal scaling.
B-Tree-Based Databases (e.g., MySQL, MongoDB): Better for relational queries but may face scalability challenges with high write volumes.

A compromise might be a document-oriented database like MongoDB. MongoDB is more horizontally scalable than relational databases, supports eventual consistency to enhance scalability, and offers better relational query capabilities than Cassandra.

Data Models

Tweet

Column	Description
tweet_ID	Primary key
created_by	User ID (indexed)
posted_time	Timestamp (indexed)
content	Text content of the tweet
number_of_likes	Counter for likes
hashtags	List of hashtags associated with the tweet
users_mentioned	List of user IDs mentioned in the tweet

User

Field	Description
user_ID	Primary key
email	Indexed
name	Full name

Hashtag

Field	Description
hashtag_ID	Primary key
email	Indexed
name	Full name
nickname	Display name

The Hashtag model uses denormalization to optimize the core functionality: mapping hashtags to associated tweets. In a relational database, this would require a foreign key relationship and query through the tweets table, whereas MongoDB’s schema flexibility enables faster lookups at the cost of managing consistency (e.g., removing tweet IDs from the hashtag list when tweets are deleted).

Follows Relationship

The Follows model is normalized for efficient querying:

Field	Description
follower	User ID of the follower
followed	User ID of the user being followed
timestamp	Time of follow action

This separation allows efficient queries for both:

Listing users a given user is following.
Listing followers of a given user.

Conclusion

It can be tempting to list every possible field and model exhaustively. However, this approach often wastes valuable time on details that may not significantly impact the final design. Instead, focus on gaining insights into how the data should be partitioned and normalized/denormalized to optimize queries for the system's specific needs. The goal is to understand the data flow and access patterns, enabling you to make design decisions that enhance performance and scalability without getting bogged down in unnecessary details.

Done

Course

Tackling System Design Interview Problems

Lessons

Database Design

Introduction

Short URL Service

Database Choice

Data Model

Twitter

Database Choice

Data Models

Conclusion

1/8