Design Twitter - System Design

System requirements

Functional:

List functional requirements for the system (Ask the chat bot for hints if stuck.)...

Non-Functional:

List non-functional requirements for the system...

Capacity estimation

QPS read : 50k

QPS write : 30k

Extra QPS for spiky traffic : 10k

size of one record = 40 bytes

TTL = 2 years

Total capacity for 2 years = 40 * 30000 * 3600 * 24 * 365 = 40 Tb

Instance size estimation

QPS : 90k

2k per cluster

45 clusters

45*3*6 cores

Replication : cross dc replication

Consistency : we need local read after write consistency so we can use regional read after write winds strategy. This is eventually consistent and not strongly consistent

Hot shards : we will need to monitor load distribution at shard and parition level. If there is uneven load distribution we may benefit from caching the most frequent urls. There are disadvantages to caching too and we will need to monitor cache hit ratio.

API design

UpsertUser'sActivity(

entity_uuid uuid.UUID (required),

event_type string (required),

event_timestamp datetime,

tweetID int (required)

)(error)

GetUser'sActivity(

entity_uuid uuid.UUID (required)

event_type (optional)

) (activites, error)

Database design

schema

CREATE TABLE user_activity(

entity_uuid uuid.UUID (8),

event_type string (8),

tweetID long long int (8),

timestamp DateTime (8)

) ((entity_uuid), event_type, timestamp)

local indexed on event_type and timestamp

local indexed on timestamp

Max no of rows per partition : 10 no of events * 365 * 2 < 10k within the db limit

Redis cache :

The key will be sessionID and the value will be the timeline feed

High-level design

Clients interact with presentation service

presentation service talks to authentication service,

friends service and userActivity service which is the closest to the database layer

presentatation service has

addUser
addFriend
addTweet
addLike
RetrieveTimeline

userActivity service has 2 endpoints that read and write data to the db

For retrieve timeline we cache the timeline in a redis cache.

Request flows

Detailed component design

For the database :

our choice of schema limits the max no of rows thats y it was choosen but for a single timeline request we make 10 parallel calls to the db. We will add a cacheFront and monitor the cacheHit ratio.

For the cache :

we use redis cache with cross dc replication and lazy fallback and a TTL for 1 minute. Need to talk about failure scenarios :

TTL time is short or 1 minute so that timeline always hits the backend after a minute and we can show real-time updates

In case of redis down or failover happening or non-replicated data : we need to query db in case of redis miss. For stale data worst case is that we serve an outdated timeline

Trade offs/Tech choices

For schema : We can choose to compact the schema and store list of friends for every user. Then we have to make a single read call to database for every timeline request. The issue is that the individual size of row will exceed 1 MB and it is not a good practice

For redis : We can also use redis and cache the most active users data so that we do not have to hit the database. I chose to use cacheFront for this purpose to have a cleaner design, easier database integration, one less component and making overall design more scalable but we compromise on latency as we are making 2 extra calls. One to userActivity and another to the cacheFront. We can also choose to do both.

Failure scenarios/bottlenecks

If any of the components is down, presentation service or userActivity service or database then we cannot comply

In case of sudden bursts of traffic we will not be serving all requests and will rate-limit at the service level so that database is not affected

Future improvements

We can think more on reducing the read latency

we can handle spikes in traffic better