System requirements


Functional:

  1. tweet a message(possibly containing text and images).
  2. view the news feed(from the users which are actually followed).
  3. follow or unfollow other users.
  4. get suggestions on the new topics and famous users to follow.


Non-Functional:

  1. Availability, system should be highly available. as millions of users will visit it daily.
  2. 100ms maximum response time for each apis.
  3. Horizontal Scalability for ginormous data.
  4. Consistency, the data displayed should be consistent.
  5. Security, authentication system should be deployed for user login and personal details.
  6. Monitoring, real time monitoring should be deployed so that developers are notified whenever a node goes down. or whenever there is any performance issue.



Capacity estimation

We are estimating approx 1 billion users' data.

for each user, we need to have the following information.

user_id, name, dob, profile picture.

considering at least 100 chars for each string and 8 bytes for user_id in total 308 bytes. and for 1 billion rounding to maximum 300TB.


apart from this, we need to store for each user the list of users which he follows. so this is like a Directed graph. for this we can use adjacency list. considering we need only user_id in the list and each user follows at max 1000 people we can have up to 1000TB data.

apart from this there should be a database of the tweet. each tweet can have following information.

user_id, post_time, (HTML) text, image

for storing images we can have their paths stored in data with actual images stored in file system or else we can also have everything stored in file system in a structured manner.

I am also thinking of using a bucket Amazon S3 and using Athena and Amazon glue for managing this. can be 1000 TB. initially.

the data size can rize up to 1GB for each user after one year considering the he does about 2mb of images and text he does post each day. which can be up to


API design

POST /tweets - user can compose and share a new tweet.

GET /home_feed - user can see their home page with the latest news feed.

POST /follow - user can follow any other user.

DELETE /unfollow - user can unfollow any other user.

GET /suggestions - user can get suggestions to follow potentially known persons like celebrities.


Database design

for storing user data. a nosql distributed database can be used like DynamoDB/MongoDB for horizontal scaling and fast retrieval.

The information should be stored as the following schema

user_id, user_name, password, name, dob, Profile picture

For storing user follows graph data. we can use Neo4J for storing structured graph data.

each nodes will contain user_id of the corresponding user.

for storing posts. Amazon s3 can be used along with amazon glue and amazon athena for structuring CRUD operations.


High-level design

Since there are multiple services needed we can go with microservices architecture.

the required services are.

  1. user authentication
  2. Post sharing
  3. follow/unfollow users
  4. user suggestions

for high level design we can have a client with the above requests and a server which talks to the high level database and response to the client.




Request flows

There will be a load balancer between the client and the API gateway. the load balancer should be distributed over geo routing, region routing, and racks routing for proper load balancing.

The API gateway will be a restless server distributed globally.

the request for a particular operation should be routed to the corresponding server by the gateway. Also the server themselves should be distributed for fault tolerance.

The server will only talk to their corresponding database which are also distributed.

for faster retrieval caching should be used between each server and database for which redis can be deployed.


there should be a distributed server for data analytics which should be recording the number of likes, views, user engagement etc. for which real time streaming services like kafka can be used along with Amazon EMR for parallel processing. the analytics data should be stored in a separate database which can be accessed by the server for user specific suggestions.


for faster load times A CDN must be placed which should be geographically distributed and have a caching mechanism from the databases.


For monitoring services, Apache Zookeeper must be used by the servers which should be given data to a separate distributed System which should have logic for collecting different metrics and messaging and sending alerts whenever a system goes down.


Detailed component design

for scalability we must have sharding in user database.

The CDN network must contain the basic HTML CSS and javascript codes for actual website along with the SVG and images.

the





Trade offs/Tech choices

Since using DynamoDB, we might have a lower availability however this will be mitigated through caching.

Also using Amazon S3 for data management can compromise availability and depends on the SLA of the system.


Failure scenarios/bottlenecks



Future improvements