Capacity Estimation

Topics Covered

Introduction

Short URL Service

Traffic Assumptions

Storage Requirements

Insights for Design

Twitter

Traffic Assumptions

Storage Requirements

Insights for Design

Conclusion

Capacity estimation can guide decisions about scalability, database design, and caching, but only if it contributes meaningfully to your design. Avoid wasting time on unnecessary or overly detailed estimations.

Instead, focus on areas where capacity metrics directly inform decisions. For example:

  • Database Scaling: Use capacity estimation to decide the number of shards or replicas needed based on query volume and storage growth.
  • Caching Requirements: Estimate traffic volumes and read-to-write ratios to determine cache size and placement.
  • Message Queue Configuration: Estimate events per second to configure tools like Kafka or RabbitMQ.
  • Load Balancer Sizing: Estimate peak RPS to configure load balancers effectively.

The key is to use capacity estimation strategically, focusing on areas that influence design decisions or address identified bottlenecks. Let’s explore capacity estimation through examples for the Short URL Service and Twitter.

Traffic Assumptions

  • 200 RPS for creating short URLs.
  • 20,000 RPS for redirecting short URLs to long URLs.

Storage Requirements

Assume each URL mapping includes the following fields:

  • Long URL (100bytes100 \,\text{bytes})
  • Short URL (8bytes8 \,\text{bytes})
  • Creation time (8bytes8 \,\text{bytes})
  • Created by user ID (20bytes20 \,\text{bytes})
  • Expiration time (8bytes8 \,\text{bytes})

Each entry totals 144bytes144 \,\text{bytes} bytes, which we round up to 256bytes256 \,\text{bytes} for simplicity.

  • Daily Storage: 200×60×60×24×256=4.4GB/Day200 \times 60 \times 60 \times 24 \times 256 = 4.4 \,\text{GB/Day}
  • Five-Year Storage: 4.4GB/Day×365×5=8TB4.4 \, \text{GB/Day} \times 365 \times 5 = 8 \,\text{TB}

Adding buffer for growth, we estimate 15TB15 \,\text{TB} of total storage over five years.

Insights for Design

  • A single relational database like MySQL or PostgreSQL could handle this scale with read replicas for redirect lookups.
  • If traffic grows beyond expectations, a distributed NoSQL database like DynamoDB or Cassandra may be required.
  • Caching frequently accessed redirects in Redis could significantly reduce database load.

Traffic Assumptions

  • 500M DAU (Daily Active Users)
  • Each user tweets twice per day, resulting in 1B tweets/day.
  • Each user views 100 tweets/day on average.
  • At peak, 20% of DAU (~100M users) interact with the system simultaneously.

Storage Requirements

Each tweet includes:

  • Tweet ID (8bytes8 \,\text{bytes})
  • Created by (user ID, 20bytes20 \,\text{bytes})
  • Posted time (8bytes8 \,\text{bytes})
  • Content (140 bytes, rounded up to 256bytes256 \,\text{bytes} with metadata)
  • Media link (optional, average 100bytes100 \,\text{bytes})
  • Number of likes (8bytes8 \,\text{bytes})
  • Hashtags (variable, average 64bytes64 \,\text{bytes})
  • Users mentioned (variable, average 64bytes64 \,\text{bytes})

We assume each tweet requires 512 bytes on average.

  • Daily Storage: 1billion tweets/Day×512bytes=512GB/Day1 \, \text{billion tweets/Day} \times 512 \text{bytes} = 512 \,\text{GB/Day}
  • Two-Year Storage: 512GB/Day×365×2=365TB512 \, \text{GB/Day} \times 365 \times 2 = 365 \, \text{TB}

Adding buffer for growth, we estimate 512TB512 \,\text{TB} of storage for two years.

Insights for Design

  • A distributed database is required.
  • Tweets should be partitioned based on user ID to balance storage across nodes.
  • Cache popular tweets in Redis to optimize read performance for frequently viewed timelines.

It's not always necessary to perform capacity estimation at the beginning of the design process. Sometimes, it makes more sense to do it when the need arises, depending on the stage of your design. For example, if you're addressing a scalability challenge and deciding whether to introduce a new component or optimize an existing one, capacity estimation can provide the data needed to make an informed decision.