Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Twitter Search with Score: 9/10

by utopia4715

System requirements

Functional:

User can tweet
Search tweets by text. Search term can be one word, or multiple words. Multiple words would be considered with "and".
User can follow other users. Search result should prioritize tweets from the users being followed.

For now, I will ignore other features like follows, comments. I'll come back to it if I have time.

Non-Functional:

Availability - It is a global service and should always be available.
Response time - Each search should return results in 500ms.
Scalability - It's a massive service. Let's assume 500M DAU and 250M searches submitted every day.
Fault-tolerance - In a distributed system, many things will fail, e.g., storage for search index may fail, network may experience errors. Tweets must not be lost so should be backed up safely. Search index can be re-generated, but with great cost, so this should be backed up too.
Privacy - Who is searching what search terms should not be revealed.
Consistency - I think it is acceptable to relax consistency requirement compared to ACID properties. For example, a user who is in the other side of the globe from the author of a tweet may not see this tweet in search for minutes. That is acceptable.

Edge case to consider:

A famous person tweets. We'd like to make sure this tweet shows up in everyone's search after a short amount of time.

Capacity estimation

500M DAU

250M searches daily

At peak, 50M search requests / s

1B tweets per day

Each tweet takes 1KB of storage, including content and metadata.

1TB data / day

0.7PB in 2 years. With user growth and some excess capacity, we should plan for 1PB in 2 years.

The main thing we need to worry about storing is tweets. Due to the size of the data (1PB in 2 years), I think a NoSQL DB is more suitable than Relational DB to store the tweets. NoSQL DBs scale better horizontally. RDB has an advantage of NoSQL for strong consistency (e.g. ACID properties), but this is not necessary for this application.

Among NoSQL DBs, wide column DB such as Cassandra seems to be a good fit. It is horizontally scalable. It is optimized for write throughput. With 1B tweets a day, this system requires high write throughput.

Argument for a document DBs like Mongo DB would be its schema flexibility and secondary index, which would make future application enhancements easier. But I lean toward wide column DB because write throughput requirement is so high. In real world, I would probably run experiments and compare the write throughput.

Assuming search index is 50% of original data, it would be 500TB in 2 years. Search services such as Elastic Search and Solr store inverse index in its own data storage system. I would use that mechanism, instead of external DBs, to store index.

Both original tweets and search index must be partitioned and backed up. Primary thoughts would be:

Geo located CDNs and data centers should be used to provide short response time for users.
Geo location should be used to guide which data should be stored where.
Generally speaking, data should have two backup copies: one in the same data center, another in a different geographic region.

API design

tweet(user_ID, text)

Application calls this when a user tweets.

follow(follower_ID, followed_user_ID)

User (with follower_ID) follows another user (floowed_user_ID). I think it is OK for this API to take only one user, as I think it's rare for a user to want to follow many users at the same time.

search(user_ID, term, start, number, criteria)

User searches for the search term in "term". The term can be single word or multi words. In case of multi words, it would be considered with "and" logic.

In addition to term, this API can specifies start and the number of tweets to be returned. For example, even if search can ultimately return 1,000 tweets, application might just want first ten, or the next ten, to show the first ten tweets quickly.

This API can also take search criteria, e.g., prioritizing users the user is following, prioritizing tweets from popular people, or new tweets, and so on.

Database design

Both original tweets and search index must be partitioned and backed up. Primary thoughts would be:

Geo located CDNs and data centers should be used to provide short response time for users.
Geo location should be used to guide which data should be stored where.
Generally speaking, data should have two backup copies: one in the same data center, another in a different geographic region.

Primary data model is Tweets:

Tweet:

tweet_ID (primary key)
user_ID (reference to user who write this tweet)
content (text content of the tweet)
timestamp (when was it tweeted)
geo location
language
users_mentioned (users mentioned in this tweet)
hashtags (hash tags mentioned in this tweet)
media links (links to images and videos)

Some of the information, e.g., geo location and language, are helpful to decide where it should be stored. For example, a tweet in Japanese language made in Japan should be stored primarily in Japan (with backup in other places).

user_ID can be used for sharding.

High-level design

Frequently accessed contents, such as tweets from popular users, images and videos, are cached in CDN, in a geo location close to large population of users, for quick access.

API Gateway rate-limits requests to prevent Denial of Service attacks and smooth out surge of requests beyond expected level.

I split app server into Tweet Service and Search Service because these two are very different. The former takes a client request and interacts with database. Search Service takes search request from Tweet Service. It can be 3rd party solution such as Solr or Elastic Search.

When Tweet Service receives a tweet, it stores the message in DB. At the same time, it creates a message in Message Queue like Kafka. Indexer pulls this message, reads tweet text, and creates index. Indexer also extracts hashtags and users mentioned.

Cache is used to speed up access throughout services.

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

Detailed component design

Let's dig deeper into search.

Tweet search is implemented by two distinct processes:

First one is Indexer. It's a streaming process. A message in the message queue would notifier the Indexer that a new tweet arrived. When it receives the message, it reads the tweet from the database, creates an inverted index (i.e. hash map which points from search words to documents). I assume Indexer does this by calling Lucene or Solr library. It then adds the new index to Search Index.

Search Index is a sharded data store. Solr has its own native data store, so we will use it to store the index (instead of an external database).

The second part is when a user searches for search terms. Tweet Service forwards the search query to Search Service, which is most likely a third party search service like Solr or Elastic Search. This will use the inverted index stored in Search Index.

For scalability and response time, Search Index should be sharded. We should be able to use languages and geographic locations as high level shard keys.

Beyond that, a simple approach of using tweet ID as a shard key might be fine. Each shard would be responsible for a range of tweet IDs. This would probably be better than sharding by user IDs, because some users are a lot more active than other users.

For storing data, Geo location would be key.

Generally speaking, wherever request comes from should be where the tweets and media files should be stored and served from.

A notable exception would be a tweet from a famous people. An international star (e.g. a celebrity)'s tweets are viewed from all across the world. Therefore, it would be important for the system (in this system, it would be Indexer's job) to detect such a tweet, and make sure the tweet (including images and videos) will be stored in database and CDNs of all geo locations.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Supports markdown