System requirements
Functional:
Core functionality
- User should be able to login using handle/email and password
- User should be able to login using OAuth
- User should be able to follow other users
- User should be able to see the tweets of user's they follow
- User should be able to favorite the tweets of other users
- User should be able to retweet the tweets of other users
- User should not be able to update a tweet
- User should be able to create a tweet (text of 140 characters + media optionally add media) in a secure way, where only they can create a tweet for their account.
- User should be able to delete a tweet
- User should get notifications when new tweets are created.
- User should get notifications when they are followed
- User should be able to comment on tweet
Non-Core functionality
- User should be able to block users
- User should be able to update their profile information
- Handle, Name, Profile Image,
Non-Functional:
Scaling and Performance
- System should be able to handle peak loads of hundreds of thousands of requests per second
- Response time for API requests should be under 200ms
- System should avoid bottlenecks in regards to creating and reading the user feed as this would be the main experience
- System should use a scalable database design, that can allow for fast fetching of data when needed.
- System should use cache to avoid hitting the database for reads
- System should allow for asynchronous population of tweets to avoid bottlenecks on writes
- System should consider fault tolerance, replication and disaster recovery.
Data Storage
- Data storage should be scalable with replication and sharding.
- Due to the nature of the relationships between user's and tweets, a relational database like Postgres or MySQL will allow for better querying.
- Applications should use separating Read and Write nodes to avoid bottlenecks.
- System should save media to an Object Store
- Sharding for Tweets table should be based on the user_id as the key
- Sharding for Followers table should be based on the follower_id as they key
Real-Time Features
- When new tweets are created, near-real-time ability to fetch those tweets from the user's feed will be needed. Using Websockets of WebRTC to notify the user of new tweets.
Security Considerations
- System should be secure and use authentication to verify that tweets are being created by the correct user before they are processed, using auth tokens or JWT.
Cache
- System should use a CDN to deliver media quickly in the user's regions
- System should use cache like Redis or Memcache to store Application Memory Data with an LRU algorithm to maximize storage capacity
- Sharding in the same way as the database
- Tweets by user_id (key)
- Followers by follower_id (key)
- Sharding in the same way as the database
Monitoring & Alerting
- Alerts on non-200 requests (40x, 50x errors)
- Alerts on Cache capacity
- Dashboard
- Monitor throughput vs errors
- Monitor CPU usage
- Monitor Memory Usage
Capacity estimation
- Target User base of of 10million users and ADU of 2,000,000 to 3,000,000 ADU
- 20 avg. reads request / second (minimum)
- Peak time of requests, could be roughly 3x the ADU. (9m ADU)
- 15-25% of those users will post on a daily basis
- ~300,000 tweets per avg day created
- ~ 1m tweets per day at peak
- ~ 1,000 tweets created per second (average)
- An average tweet with 140 characters and meta data could equal to 1kb per tweet
- 0.3 gb per day / 110 giga bytes per year
API design
Login/ Authentication
- /login
- /logout
- /oauth-callback -> redirect
Tweet Service (CRUD)
- /create
- /delete/:id
- /follow/:id
- /:user_id/favorite/:tweet_id
Feed Service
- /feed
Notification Service
- /:user_id/notifications
- /:user_id/notifications/mark_as_read/:notification_id
User Service
- /profile/:user_id/update
Database design
Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...
User
- user_id PK
- password (encrypted)
- created_at
- updated_at
- oauth_provider
- profile_id FK
Profile
- profile_id PK
- user_id FK
- description
- handle
- image_url
- hero_image_url
- created_at
- updated_at
Tweet
- tweet_id PK
- user_id FK
- content
- media_id
- aggregated_favorites
- aggregated_retweets
- aggregated_comments
- tweet_id FK
- created_at
- updated_at
- deleted_at
Comments
- comment_id PK
- tweet_id FK
- user_id FK
- comment_id FK
- content
- aggregated_likes
- aggregated_favorites
- created_at
- updated_at
- deleted_at
Media
- media_id PK
- media_url
- media_type
- tweet_id FK
- created_at
- updated_at
- deleted_at
Notifications
- notification_id PK
- user_id FK
- follower_id
- tweet_id
- created_at
- updated_at
- read_at
Followers
- follower_id PK FK references user(user_id)
- following_id FK references user(user_id)
High-level design
Read
- Client -> API Gateway
- Client -> CDN (Media Store)
- API Gateway -> ALB
- ALB -> App Servers
- App Servers -> (Cache Hit) -> Feed Cache (Redis w/ LRU)
- App Servers -> (Cache Miss) -> MySQL (Read Only)
Create
- Client -> API Gateway
- API Gateway -> ALB
- ALB -> App Servers
- App Servers -> Pub/Sub
- Pub/Sub -> Spark Job
- Spark Job -> (Generate Feed Cache) -> Feed Cache (Redis w/ LRU)
- Spark Job -> (Update Database) -> MySQL (Write)
- Spark Job -> Notification Service
- Notification Service -> WebSocket
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?