System requirements


Functional:

Users can:

  • CRUD tweets
  • share tweets
  • track updates of other users (newsfeed)
  • like tweets

User following, registration, auth, etc. are implied and out of scope for the purposes of this discussion



Non-Functional:

  • System should have low latency on reads (< 100ms)
  • Scale to support 300M active users generating a median of 5 tweets/day (1.5B daily), ~10 * 1.5B newsfeed requests/ day
  • Tweets are read heavy and system to be able to support high read throughput (10:1) (100:1 or even higher depending on whether we're referring to requests or tweets as the unit here)
  • Tweets can have spikes in "likes," meaning the system needs to handle high write throughput for popular tweets




Capacity estimation

Feed cache:

300M Active Users * 500 max tweets in cache/user * (200 chars/tweet + 8 char prof pic url + 15 char username) = 3 * 10^2*10^6*5*10^2 * 250=15*25*10^11 = (250+125)*10^11=375*10^11 = 37.5TB (in reality less, because we can also set cache entries to expire after a week, meaning most feed caches won't reach max capacity)




API design

POST /api/v1/tweet -> tweet id

{

text

sharing?: {tweet id}

}


PUT /api/v1/tweet/{tweet id}

{

text

}


PUT /api/v1/like/{tweet id}

{

like: bool

}


DELETE /api/v1/tweet/{tweet id}


GET /api/v1/feed/{next id} -> list of tweets along with tweet metadata (likes, user info)





Database design

Core entities:

  • Users
  • Tweets
  • Likes


Tweets table:

ID

Text

User ID

Shared Tweet ID


Likes table

Tweet ID

User ID





High-level design

Creating, Updating, Deleting tweets:

The tweet service will be a basic CRUD app. The postgres DB will be sharded by tweet UUID via consistent hashing.


Liking tweets:

"likes" can be routed to a Kafka queue to handle "hot" tweets. Spark streaming workloads can process likes in mini batches, partitioned by tweet ID. Spark will query the tweet service to see if the tweet ID exists, and if it does, write to the likes DB. The postgres likes DB will be sharded by tweet ID via consistent hashing. There's an unlikely, but still possible, race condition where Spark could check if a tweet exists, get a "yes" response, but then immediately after the tweet could be deleted while the likes db for that tweet is being written to. This should be relatively infrequent though, wasting a negligible amount of space. This situation could also be handled by running a periodic cleanup job.


Generating Feed:

The "feed" service returns a pre-populated, cached feed of up to 500 of the most recent tweets a user follows (self auto-followed), ordered by creation time. The service will store the 500 tweets in a Redis KV store and return 20 tweets at a time to a user as a paginated result set. Each result returned to the user will include an ID which will be used to fetch the next batch of 20 from the redis cache (if such a batch exists).


The feed cache itself will be populated in the background by Apache Flink workers that will consume tweets updates from CDC, identify followers of the tweet by querying the "follower" service, and push to their respective feed caches along with user and like info associated with each tweet queried from the user and likes services respectively. The cache content itself will use a Redis sorted set where tweets are sorted by the order the Flink worker pushed them to the cache, via a monotonic counter maintained on Flink corresponding to the order tweet updates were received








Request flows

CRUD Tweets:

  1. Client calls CRUD APIs
  2. Call passes through LB -> API GW -> Tweet Service
  3. Tweet Service processes API request and returns appropriate HTTP response to client.
  4. In background, Flink detects update via CDC and updates feed caches accordingly


Like Tweet:

  1. Client calls like API
  2. Call passes through LB -> API GW -> Kafka, then client returns 200 response
  3. Worker consumes Kafka event(s), and processes them in batches


Read Feed:

  1. Client calls feed API, with offset (unless initial query)
  2. Call passes through LB -> API GW -> Feed service
  3. Feed service queries Redis cache with offset (if applicable), and returns results along with next offset (if there are more results)



Detailed component design

Discussed in detail above.





Trade offs/Tech choices

Discussed in detail above.



Failure scenarios/bottlenecks

Discussed in detail above.





Future improvements

Discussed in detail above.