Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Typeahead Suggestion with Score: 9/10

by radiant_alchemy815

System requirements

Functional:

Matching supported at the beginning of the search query.
5 auto complete suggestions should be returned.
No spell check is supported as of now . Can be implemented later.
Search queries will be in english and lowercase alphabets
Return 5 suggestions sorted by popularity.

Non-Functional:

Highly available (always available for search).
Scalable.
Low latency.

Capacity estimation

Let's assume,
10 million DAU
An average person performs 10 searches/day.
1 search/query string = 10 bytes of data:
Assume we use ASCII char encoding
1 char = 1 byte
1 word = 5 char average
1 query avg = 2 words
2words * 5bytes = 10 bytes/query
On average , 10 req are sent for each query.
e.g: "mango" -> 5 request

search?q=m

search?q=ma

search?q=man

search?q=mang

search?q=mango

No of users = 10,000,000
Queries /user/day = 10
Total queries per day = 10,000,000 * 10
Char per query = 10
seconds/day = 24*3600 = 86400

QPS: 10,000,000 * 10 / 86400

Assuming 10% of daily queries are new.
100,000,000 *10 bytes
1GB = 10 ^9 B
Total GB = 100,000,000 *10 / 10^9 = 1GB
This much data gets added to storage daily.

TRAFFIC: Mixed workload, (read + write heavy)

API design

GET / GetTopSuggestion(prefix) . This will return the top suggestions to be displayed in the UI as array in JSON object.
POST IncreaseFrequency(search_term). This query will be used to insert a search_term in DB .

Database design

We can have a key value store to store our data as it is optimised for performance and retrieval can be done in O(1) time. Also using relational DB here will be inefficient as we need to return the top K elements in O(1) time.

High-level design

Let's break down the system :

Trie data structure. Fetching the top 5 search queries from relation database is not efficient. For this Trie data structure is used.

                    (root)
                     /    \
                    b        
{bee:25,best:20    / \
,be:15}           /   \
                 /     \   
                be:15   bu     
               /    \     \  
{bee:25,
beer:10}  bee:25 bes:5      buy    
               |        \    
             beer:10     best:20    

a . We can limit the length of the search query typed by user . As user rarely types long search queries. TC: O(Prefix length) to O(1).

b. We can cache the top k search queries at each node. TC: O(1).

So to search , the steps are:

Find the prefix query O(1).
Return top k queries . Since it is cached , it will be O(1).

So in O(1) our algorithm will return top k queries.

Data gathering service . It gathers user input queries and aggregates and process it.

Updating the trie on every will slow down the query service. Top suggestions such as Google keywords usually don't change much . Thus it is unnecessary to build trie on every query.

Data used to build our trie is from analytics or logging service.

Query service. Given a query, return 5 most frequently searched terms.

a. We can use AJAX request to get to results. The main benefit is it does not refresh the entire web page when sending/receiving a request/response.

b. Browser caching

c. We can do Data sampling. As logging each and every query is quite intensive work. We can randomly log 1 out of every N request in the system.

const searchQueries = [
  "weather forecast",
  "best restaurants near me",
  "javascript tutorial",
  "latest tech news",
  "how to cook pasta",
  "vacation destinations 2023",
  "movie reviews",
];
function getRandomSample(arr, sampleSize) {
  const result = [];
  const arrCopy = [...arr];
  for (let i = 0; i < sampleSize; i++) {
    const randomIndex = Math.floor(Math.random() * arrCopy.length);
    const selectedElement = arrCopy.splice(randomIndex, 1)[0];
    result.push(selectedElement);
  }
  
  return result;
}
const sampleSize = 3;
const randomSample = getRandomSample(searchQueries, sampleSize);

Trie Operations:

Create: Trie is created by workers using aggregated logs from Analytics.

Update: Weekly the trie is updated.A new trie created will replace the old trie.

Delete: A Filter can be added in front of trie cache to delete violent abusive search terms. Asynchronously , this date be remove from database also to build a new trie in the next cycle.

Scale the storage:

We can have a shard logic built using the pattern from historical data gathered and main a lookup db to know where the rows are stored. Like if most of the queries starts from letter 'b' and rest of the queries lies in 'b', 'c','d',e'. We can have 2 shards built.

Request flows

A search query is sent to LB.
The load balancers route the requests to API servers.
API servers get trie data from trie cache and form the auto complete suggestions for client.
In case data is not in trie cache, we get the data from trie DB , adds to cache and return it to api server to form the result.

Detailed component design

AnalyticsLogs: It stores raw data about search queries and simple append it.

Aggregators: Since the analytics logs will be large , we need to aggregate it so that it can be processed by our system.

Here we don't need to aggregate data very often so we can get it once a week. We assume trie is built weekly.

Aggregated Data:

Query  Time(start time of week) Freq
tree.   2024-06-06.          15000
Mango.  2024-06-06.          150
summer. 2024-06-06.          2000

Workers:

These servers perform asynchronous jobs at regular interval and build trie data structures and store it in trie DB.

Trie Cache:

This is a distributed cache system and keeps tries in memory for fast read .

Trie DB:

We can use a key value store to represent a trie.

Key : prefix term in trie

value: data on each node

Key                       Value
b               [bee:25,best:20,be:15]
be              [bee:25,best:20,be:15]
bee             [bee:25]
best            [best:20].

e,g: Amazon dynamo db

Trade offs/Tech choices

No real time update of search terms.We are updating out trie weekly as of now to avoid the load but let's say like twitter we need our search to be in real time then we can do it more frequently.

Also we are doing random sampling with our data as storing all the searched terms and forming trie with it is not efficient.

We can also do filter out more on our search data using decay factor to filter out the popularity of a search term.

Example Scenario:
Initial Score: Suppose an article starts with a popularity score of 100.
Decay Rate: The popularity score decreases by 10 units per hour.
Let's calculate the popularity score of the article after different time intervals:

After 0 hours (Initial Score): Score = 100
After 1 hour: Score=100−10=90
After 2 hours: Score=100-2*10= 80
After 3 hours: Score=100−3×10= 70
And we can define a threshold to reject a search term to remove it from our popularity list.

Failure scenarios/bottlenecks

Loss of network. We have handled it by caching the data in client side as well.

Future improvements

To support multi language search .