Database Design

Topics Covered

Introduction

Short URL Service

Database Choice

Data Model

Twitter

Database Choice

Data Models

Conclusion

Choosing the right database is one of the most critical steps in system design, as it directly impacts scalability, performance, and maintainability. The choice begins with capacity estimation and understanding the system's use cases. For small-scale systems, relational databases like PostgreSQL or MySQL often suffice due to their robust query capabilities and strong consistency. However, for larger-scale systems handling significant data volumes (e.g., 1 TB to 100 TB) or high write throughput, distributed NoSQL databases like Cassandra, DynamoDB, or Bigtable may be necessary.

Equally important is data modeling, which determines how data is structured and accessed. Data modeling can help identify optimal partitioning keys, normalization, denormalization opportunities, and trade-offs between performance and storage. The level of detail invested in data modeling should be proportional to its impact on the design.

Database Choice

The primary data model for a URL shortener is a mapping between short and long URLs. This system has:

  • Simple Queries: Most operations involve basic lookups or inserts.
  • Strong Consistency Requirements: Once a mapping is written, it must be immediately visible to all readers.
  • Read-Heavy Traffic: Far more redirects (reads) occur than URL generations (writes).

A key-value store like DynamoDB is a natural fit. DynamoDB provides horizontal scalability, high performance, and optional strong consistency.

Data Model

URL Mapping

FieldDescription
short_urlPrimary key (8 bytes)
long_urlIndexed (100 bytes)
created_timeTimestamp (8 bytes)
created_byUser ID (optional, 20 bytes, indexed)
expiration_timeTimestamp (8 bytes)

Additional secondary indexes can support features like:

  • Listing all short URLs created by a user.
  • Expiring short URLs based on expiration_time.

Initial Insights

  • Caching: We could use Redis to cache frequently accessed short-long mappings, reducing database load.
  • URL Generation:
    • We could use Base-62 encoding to generate compact, user-friendly short URLs.
    • Implement collision handling for cases where the same short URL is generated more than once.

Database Choice

Twitter’s data characteristics include:

  • Small Row Sizes: Individual tweets are small (e.g., ~512 bytes).
  • High Write Volume: Billions of tweets are generated daily.
  • Relational Query Needs: Queries like “list tweets by a user” require relationships.
  • Eventual Consistency: Strong consistency is not strictly necessary for tweets.

Given these requirements:

  1. LSM-Based Databases (e.g., Cassandra): Ideal for write-heavy workloads and horizontal scaling.
  2. B-Tree-Based Databases (e.g., MySQL, MongoDB): Better for relational queries but may face scalability challenges with high write volumes.

A compromise might be a document-oriented database like MongoDB. MongoDB is more horizontally scalable than relational databases, supports eventual consistency to enhance scalability, and offers better relational query capabilities than Cassandra.

Data Models

Tweet

ColumnDescription
tweet_IDPrimary key
created_byUser ID (indexed)
posted_timeTimestamp (indexed)
contentText content of the tweet
number_of_likesCounter for likes
hashtagsList of hashtags associated with the tweet
users_mentionedList of user IDs mentioned in the tweet

User

FieldDescription
user_IDPrimary key
emailIndexed
nameFull name

Hashtag

FieldDescription
hashtag_IDPrimary key
emailIndexed
nameFull name
nicknameDisplay name

The Hashtag model uses denormalization to optimize the core functionality: mapping hashtags to associated tweets. In a relational database, this would require a foreign key relationship and query through the tweets table, whereas MongoDB’s schema flexibility enables faster lookups at the cost of managing consistency (e.g., removing tweet IDs from the hashtag list when tweets are deleted).

Follows Relationship

The Follows model is normalized for efficient querying:

FieldDescription
followerUser ID of the follower
followedUser ID of the user being followed
timestampTime of follow action

This separation allows efficient queries for both:

  • Listing users a given user is following.
  • Listing followers of a given user.

It can be tempting to list every possible field and model exhaustively. However, this approach often wastes valuable time on details that may not significantly impact the final design. Instead, focus on gaining insights into how the data should be partitioned and normalized/denormalized to optimize queries for the system's specific needs. The goal is to understand the data flow and access patterns, enabling you to make design decisions that enhance performance and scalability without getting bogged down in unnecessary details.