My Solution for Design Twitter with Score: 9/10
by cipher_lotus88
Requirements
- User Registration and Authentication
- Tweet posting and retweeting
- Following users for tweets
- Timeline of relevant tweets from users or favourites
- Favouriting tweets
- Searching tweets, users, or topics
- Notifications of favourites, retweets and followers
Traffic estimation
Twitter receives on average 6,000 tweets per second; the maximum ever per second was near 150,000 in 2013. Twitter has over 450 million active users.
The duration of a single request, I would estimate, is on average 150 ms.
API design
Users
Users are created, queried, updated, and deleted through the cognito service using lambdas.
Searching
Search requests for tweets and users are sent to elastic search, which has been indexed from the database.
Write-only API
Write operations include posting tweets, retweets, favorites, notifications, following, editing, and deleting.
Lambdas from a load balancer will handle requests to write data entities.
Write operations asynchronously get put in a SQS messaging queue.
The SQS lambda will ensure write requests are sequenced and inserted into a DynamoDB.
Average Concurrency = (6000/tweets per second) * (0.1 seconds a tweet) = 600 concurrent executions
600 concurrent lambda execution environments is below the default of 1000, which leaves room before scaling for peaks.
Error handling goes to cloudwatch monitors.
An SNS queue will publish the result or error.
Read-only API
Read operations include user profiles, timelines, newsfeeds, settings, and analytics.
An Elixir Phoenix server on EC2 will serve read-only API requests from the load balancer.
The server leverages the fault tolerance and scalability of the Erlang VM to provide vertical scaling.
Autoscaling groups will handle horizontal scaling. Cloudwatch metrics and alarms for EC2 will provide scale-out and scale-in policies.
Dynamodb will serve as the data store for the read-only queries.
Error handling goes to cloudwatch monitors.
Real-time API
Real-time operations include notifications and data changes.
Real-time data will reuse the Elixir Phoenix websockets on the read-only server.
The real-time server will provide endpoints for the load balancer.
The real-time server will listen to an SNS service for various topics and send them to connected clients.
The Erlang VM will handle vertical scaling and fault tolerance.
The number of current connections will be reported to cloudwatch monitors for autoscaling.
Errors will be sent to CloudWatch.
When a user is connected to the real-time server, new tweets appear in their timeline automatically, and notifications are delivered instantly.
When a user is not connected to the real-time server, they must manually refresh.
Database design
Data models: Tweet, Retweet, Favorites, Users, Followers, and Notifications (see diagrams)
Storage will occur in DynamoDB instances to leverage built-in features like caching, storage optimization, and scaling.
Media storage will occur in an S3 bucket.
High-level architecture
AWS will be the primary cloud provider on cost-saving plans. Services used: Cognito, Application Load Balancer, Lambda, SNS, SQS, DynamoDB, DAX, ElasticSearch, Comprehend, Route53, EC2, and IAM.
Route 53 will be used for domain names, certificates, and CDNs.
AWS Load Balancer will be used as an access point to regional services.
DynamoDB will be the data store, with DAX for caching and streams for events.
ElasticSearch and Comprehend provide extra search capabilities.
Cloudwatch metrics and alarms will be used to optimize autoscaling, throttling, and rate-limiting at peak and quiet times.
System services will be under a private VPN.
EC2 instances will provide read-only and real-time functionality.
Lambdas will provide write-only and event-handling functionality.
Geographically, AWS regions will allow availability, and Route 53 CDNs will be distributed to the best region.
Request flows
Client to Server
1) User interface creates http post request
2) The client resolves TLS and the DNS from Route 53
3) The client sends the request
4) The application load balancer for the clients region receives the request
5) The load balancer forwards to the appropriate service
Post Tweet
1) The request is sent to a lambda
2) The lambda forwards the request as an SQS Event for saving
3) The lambda publishes the data change event over SNS
4) The lambda responds success to the client
Post Tweet Success
1) A lambda receives the tweet SQS event
2) any media files are saved to S3 storage
3) the data model is created and inserted into DynamoDB
Post Tweet AI
1) A lambda receives a data change from a DynamoDB Stream
2) Comprehend analyzes the tweet
3) The comprehend lambda publishes the result back to DynamoDB
Post Tweet Search
1) A lambda receives a data change from a DynamoDB Stream
2) The content is indexed and sent to ElasticSearch
Component design
Authorization
AWS Cognito will handle user registrations, logins, logouts, 2FA, and session timeouts via Lambda functions.
The load balancer integrates with cognito to provide user authentication to services automatically.
Accounts in cognito will be associated to a User table in DynamoDB by id.
Error handling will go to CloudWatch for monitoring.
An SNS queue will publish new users.
Caching
Dynamodb Accelerator (DAX) will provide query caching for user-specific queries.
This will allow efficient, complex queries for timelines and retweets.
The accelerator will also automatically handle cache updates with dynamodb changes.
Search
Elastic Search will provide search functionality for DynamoDB.
The load balancer will route the search endpoints to elastic search.
Streaming will be enabled on DynamoDB tables for data entities that should be searchable.
A DynamoDB Stream Lambda function will transform DynamoDB data streams for Elastic Search and keep the index updated.
Errors will be sent to CloudWatch.
AI
Each tweet will be analyzed for information that is used for timeline's and searching.
A DynamodDB stream event will notify a lambda to perform analysis.
Trade offs/Tech choices
AWS Lambda
The tweet lambdas will be asynchronous using a message queue, meaning the requests per second will be within expectations but can still scale when needed.
Elixir Phoenix
The Erlang virtual machine that Elixir uses is proven to be highly fault-tolerant and scalable on single servers (whatsapp).
The syntax of Elixir is quite like that of Ruby, which is good for a transfer of skills.
Elixir being a pure functional language, avoids bugs related to state that are often not necessary in the context of an idempotent API.
Read vs Write API
Delegating write requests to separate infrastructure allows for better handling of requests per second and the duration of requests.
Delegating read requests to a separate infrastructure allows for greater data integrity and availability.
DynamoDB NoSQL database
Chosen for built-in scaling, caching, and good performance on writes.
Failure scenarios
Post Tweet Error
1) CloudWatch logs are updated
2) The error is published to an SNS service
3) The client's real-time connection displays the error and actions to resolve it
Future improvements
S3 to ELB Storage
Storing media files in S3 provides simplicity, but it might become cheaper to store them in EBS and provide access with links through the read-only API.