Codemia | Master System Design Interviews Through Active Practice

My Solution for Design Yelp or Nearby Friends

by nectar4678

Let’s design a Yelp like service, where users can search for nearby places like restaurants, theaters or shopping malls, etc., and can also add/view reviews of places.

System requirements

Our proximity server will contain data about different locations, allowing users to search for them based on their current whereabouts. When users make a query, the server will provide a list of nearby places corresponding to their search criteria. For instance, if a user searches for nearby "restaurants," the server will fetch a list of restaurants in close proximity to their current location.

Functional Requirements

The core function of a proximity server involves conducting searches. It allows users to explore nearby locations within a designated radius based on specific latitude and longitude coordinates.
Users have the ability to establish new locations or modify existing ones by incorporating or updating basic elements like images, a concise description, and similar features.
Users have the option to provide ratings (ranging from one to five stars) and reviews, including written feedback and photographs, for places.

Non-Functional Requirements

We may anticipate this application being read-heavy due to a large number of search requests compared to the frequent inclusion of new locations.
The system should be extremely reliable, with real-time search capabilities and low latency.

Capacity Estimation

With 100K queries per second, the total number of spots in the system is estimated to reach 200 million. We should develop our system at a size of at least 5 years, assuming a future scale of 5 years with 20% annual growth.

Our system should be capable of handling a population of 500 million people.
A load of 200K requests per second should be no problem for our system.

High Level Design

Basically, we have 3 high level working api's for this service, which are

Add places: Add places API is responsible for adding places and returns the response of newly created place with its place_id.

REQUEST URL: api/i1/addplace
    REQUEST BODY:
    {
        name: "Old town cafe",
        category: "Restaurant",
        description: "Good local place to hangout",
        photos: ["url1", "url2"],
        latitude: 98.735249,
        longitude: 76.300408,
    }
    RESPONSE BODY:
    {
        id:  <place_id>,
        name: "Old town cafe",
        category: "Restaurant",
        description: "Good local place to hangout",
        photos: ["url1", "url2"],
        latitude: 98.735249,
        longitude: 76.300408,
    }

Add reviews: Add reviews API is responsible for adding reviews about the place.

REQUEST URL: api/i1/review
    REQUEST BODY:
    {
        place_id: <place_id>
        user_id: <user_id>,
        rating: 5,
        review: "Coffee was great here.",
        photos: ["url3", "url4"],
    }
    RESPONSE BODY:
    {
        id: <review_id>,
        place_id: <place_id>,
        user_id: <user_id>,
        rating: 5,
        review: "Coffee was great here.",
        photos: ["url3", "url4"],
    }

Search API: Search API is responsible for searching any user query and returns information regarding the searched query.

REQUEST URL: api/i1/search
NOTE: this needs to support pagination as results can grow large for given filter
    REQUEST PARAMS:
    {
        keyword: "pizza",
        radius: "12",
        category_filter: "italian",
        min_rating: 4,
        max_results: 500,
    }
RESPONSE: A JSON containing information about a list of popular places matching the search query

Here is how the request flow is going to look like for search:

Databse schemas

Each dataset given above must be stored and indexed (places, reviews, etc.). Users want to see results in real-time while searching for local places; therefore, the indexing must be read-efficient to query this enormous database. We don’t need to worry about frequent data updates because the position of a place doesn’t change too often. In contrast, if we want to construct a service where items, such as people or cabs, change their location regularly, we might develop a radically different design.

SQL Based storage

Places, reviews, and user details can be stored in an SQL database and indexed to optimize search using latitude and longitude. To further improve search efficiency, we can use the concept of 2D grids to divide the world map into smaller squares. For example, if we assume the earth's surface area is 100 million square kilometers and the search radius is fixed at 10 kilometers, we can create 10 million squares with a grid size of 10 kilometers. By fixing the grid size to the query radius, we can limit the search to the target grid and its eight neighbouring grids.

Every place with a location will belong to a specific grid. Each grid will have a unique id that can be indexed and stored in the places table. Let’s see how our basic search flow works :)

We can find the grid id for every location and its eight nearby grids because our grids are statically created with a search radius equal to the grid size. As a result, the query’s overall runtime will be lowered and improved, as the search query execution scope has been narrowed to just nine grids instead of the brute force strategy, which requires us to search for the entire map.

We can make it even faster by storing the grid’s number and a list of its locations in memory. As previously stated, we will have 10 million grids, with each grid id being 5 bytes and the place id being 10 bytes, similar to the gigantic scale of 100 million locations. As a result, the total amount of memory required to cache grid and place ids is 2 GB. (10M * 5) + (100M * 10) ~ 2GB.

Drawback with this approach

For popular locations, this approach may be slow to implement due to an imbalanced distribution of locations within the grids. This can lead to some grids being heavily populated and others being sparsely populated, such as in coastal regions or on islands.
An alternative approach is to dynamically adjust grid sizes by maintaining a maximum number of locations in each grid. However, this strategy can be challenging to implement and adds complexity to the system.

However, there is an interesting data structure called Quadtree which can help us. Learn more about Quadtrees here: https://www.geeksforgeeks.org/quad-tree/

QuadTrees

Quadtrees are trees used to efficiently store data of points on a two-dimensional space. In this tree, each node has at most four children. We can construct a quadtree from a two-dimensional area using the following steps:

Divide the current two dimensional space into four boxes.
If a box contains one or more points in it, create a child object, storing in it the two dimensional space of the box
If a box does not contain any points, do not create a child for it
Recurse for each of the children.

Initially, all of the locations are stored in a single root node. However, as our system is designed to handle a scale of 100 million locations over five years, the root node will not be able to hold them all. The root node will therefore be recursively split into child nodes until no nodes contain more than 100 locations. This results in a quadtree with leaf nodes that store all of the locations.

Let’s see how our basic search flow works in this case

To search for a location in our quadtree, we start at the root node and work our way down the tree until we reach the appropriate leaf node. This is because locations are only stored in leaf nodes.

Our quadtree creation approach ensures that locations in neighbouring nodes are geographically close to each other. As a result, when searching for nearby locations, we also consider the data in neighbouring nodes.

To improve search performance, we can cache the quadtree information. With a total of 1 million nodes (100 million locations divided by a bucket size of 100), we can estimate that the node id will be 5 bytes long and each node will contain four child pointers. In addition, we need 10 bytes each for the location id, latitude, and longitude. This means that the total storage requirement is approximately 4GB (3* 10 * 100 M + 451 M ~= 4 GB).

Database Sharding

Given the scale of our system, we cannot rely on a single server to handle all of the traffic. This could create a single point of failure, which is unacceptable in today's distributed systems. One solution to this issue is to partition the quadtree data using a method such as region-based sharding or sharding based on place id.

Region based sharding: This approach involves partitioning the data based on regions, but this can lead to a non-uniform distribution of data because some regions are more densely populated than others. This can make it difficult to achieve a uniform distribution of data.
Sharding based on place_id: An alternative approach is to shard the data based on the place id. We can use a hash function to calculate the hash of each place id as we build the quadtree and map each place id to a specific server where the location information is stored. This can help to evenly distribute the data across servers.

The second method of sharding based on place id appears to be simpler and more effective. Even if we end up with multiple quadtrees, this is not a major issue because the data will be uniformly distributed across all servers. This ensures that the system can handle the expected traffic and maintain high availability.

Data Replication

In a distributed system of this scale, it is essential to have the ability to operate correctly even in the event of failures. To ensure high availability, we cannot rely on a single machine as this creates a single point of failure.

One solution is to use a master-slave architecture, where only masters can write data and slaves can only read. If a master server goes down, any slave servers can take over as the master and handle writes. There may be a small delay of a few milliseconds in updating newly changed data, resulting in eventual consistency, but this should not be a significant issue for this application.

To further improve fault tolerance, we can also have a replica of the quadtree index server. If a quadtree server fails, it can be quickly rebuilt by querying the index server instead of the database, which can help to reduce the load on the database.

Load Balancing

To improve the efficiency of our system, we can use load balancers to keep the load in check.

A simple round robin technique can evenly distribute incoming requests among the backend servers. This type of load balancer is easy to set up and does not add significant overhead. It also has the advantage of automatically removing a server from the rotation if it goes down, which helps to prevent traffic from being sent to an unavailable server. However, the round robin method does not take server load into account, so a more intelligent load balancer that regularly polls the backend servers and adjusts traffic based on their load may be necessary in some cases.