Codemia | Master System Design Interviews Through Active Practice

Design Uber Backend with Score: 9/10

by alchemy1135

System requirements

Functional:

User Registration: Users should be to register as passengers or drivers, providing necessary personal information such as name, contact details, and payment information.
Matching Algorithm: Develop an algorithm to efficiently match passengers with nearby drivers based on factors like location, availability, and ride preferences.
Ride Request: Passengers should be able to request rides by specifying their current location, destination, and preferred vehicle type (e.g., sedan, SUV).
Real-time Tracking: Passengers should be able to track the location of their assigned driver in real-time until reaching the destination.
Payment Processing: passengers should be able to make secure payments for their trips, It should support various payment methods like credit/debit cards, digital wallets, and cash.
Rating System: Passengers and drivers should be able to provide feedback on ride experiences, helping maintain service quality and accountability.
Notifications: Users should receive timely notifications regarding ride status updates, driver arrival, payment confirmation, and other relevant information.
Messaging: Passengers and drivers should be able to message each other when the trip has been booked.

Non-Functional:

Performance: Ensure the platform can handle a high volume of concurrent users and requests efficiently, with minimal latency in matching, tracking, and payment processing.
Scalability: Design the system to scale horizontally to accommodate increasing user demand and geographical expansion without compromising performance.
Reliability: Maintain high system availability to ensure users can access the platform and request rides reliably at any time, with minimal downtime or service interruptions.
Security: Implement robust security measures to protect user data, prevent unauthorized access, and ensure the integrity and confidentiality of all transactions and communications.

Capacity estimation

Consider the below assumptions

Total user base of 5 million users and 200000 drivers
We have 1 million rides daily
User performs 5 actions on an average

With the above assumptions, let's answer the below question

How many requests do we need to handle every day?

Since we have 1 million daily users and each user makes 5 requests.

Total requests = daily active users * number of requests per user

Total requests = 1 million * 5 requests

Total requests = 5 million requests

How many Requests per Second (RPS)?

Total number of requests daily = 5 million requests (assuming each user performs 5 actions daily)

Therefore, requests per second (RPS) can be calculated as follows:

RPS = Total requests per day / Number of seconds in a day

RPS = 5,000,000 / 86,400 seconds (24 hours * 60 minutes * 60 seconds)

RPS ≈ 57.87

So, the system needs to handle approximately 58 requests per second on average.

How much storage do we need per day?

Each message is assumed to be about 500 bytes.

Total storage needed daily can be calculated as:

Daily storage = Total daily messages * Average message size

Daily storage = 5 million messages * 500 bytes per message

Daily storage = 2,500,000,000 bytes

Converting bytes to gigabytes (GB):

1 GB = 1,073,741,824 bytes

Therefore, Daily storage ≈ 2.33 GB

So, the system needs approximately 2.33 gigabytes of storage every day to accommodate the messages generated.

API design

User Registration API
Description: Allows users to register as passengers or drivers on the platform.
Input: User details (name, contact info, payment info) and role (passenger/driver).
Output: Success/failure status indicating whether the registration was successful.
Ride Request API
Description: Enables passengers to request rides by providing their current location, destination, and preferred vehicle type.
Input: Passenger location, destination, preferred vehicle type.
Output: Confirmation of ride request with details such as driver assignment and estimated time of arrival.
Driver Availability API
Description: Allows drivers to update their availability status (online/offline) on the platform.
Input: Driver ID and availability status.
Output: Confirmation of availability status update.
Real-time Tracking API
Description: Provides real-time location tracking of assigned drivers to passengers.
Input: Ride ID or driver ID.
Output: Continuous updates of driver's current location until reaching the destination.
Payment Processing API
Description: Facilitates secure payment transactions between passengers and drivers.
Input: Payment details (amount, method, transaction ID).
Output: Confirmation of payment completion and transaction ID.
Rating and Feedback API
Description: Allows passengers and drivers to rate and provide feedback on each ride.
Input: Ride ID, user ID, rating, and optional feedback.
Output: Confirmation of rating submission and any applicable incentives or penalties.
Notification API
Description: Sends timely notifications to users regarding ride status updates, driver arrival, and payment confirmation.
Input: User ID, notification type, and relevant ride/payment details.
Output: Delivery status of the notification.
Ride Fare Estimation API
Description: Provides an estimated fare for a ride based on the distance between the pickup and drop-off locations and the selected vehicle type.
Input: Pickup location, drop-off location, preferred vehicle type.
Output: Estimated fare for the ride.
Chat API
Description: Initiates a chat conversation between a passenger and the assigned driver for communication purposes.
Input: Ride ID or user IDs of the passenger and driver.
Output: Confirmation of the chat initiation and a unique identifier for the chat session.

Database design

Database choices

User Data:
Database Type: SQL (Relational Database)
Reasoning: SQL databases provide strong consistency and are suitable for managing structured user data such as profiles, contact details, and payment information.
CAP Focus: Consistency Focused
Rides and Bookings:
Database Type: SQL or NoSQL (Depends on scale and complexity)
Reasoning: SQL databases can handle transactional data effectively, ensuring data consistency for ride bookings and history. However, as the scale grows, NoSQL databases like Cassandra or MongoDB may be more suitable for accommodating high write throughput and flexible data models.
CAP Focus: Balanced (SQL), Availability Focused (NoSQL)
Payments:
Database Type: SQL (Relational Database)
Reasoning: SQL databases are well-suited for managing financial transactions due to their ACID compliance and strong consistency guarantees, ensuring data integrity for payment processing.
CAP Focus: Consistency Focused
Drivers:
Database Type: NoSQL
Reasoning: NoSQL databases may be preferred for handling real-time driver availability and location updates at scale.
CAP Focus: Availability Focused (NoSQL)
Ratings:
Database Type: NoSQL
Reasoning: NoSQL databases can handle unstructured or semi-structured feedback data more flexibly.
CAP Focus: Availability Focused (NoSQL)

Data Partitioning and Sharding

Given the distributed nature and high volume of data in a ride-sharing platform like Uber, a suitable partitioning strategy would be Horizontal Partitioning or "Sharding" based on geographical regions, where data related to users, drivers, rides, and bookings are partitioned across different geographic regions or cities.

This strategy ensures that data is distributed evenly, optimizing query performance and scalability, while also aligning with the natural segmentation of the platform's operations based on geographical locations.

The consistent hashing algorithm can be used for sharding, as it allows for efficient and balanced distribution of data across shards while minimizing data movement when the number of shards changes or nodes are added or removed from the system. This algorithm ensures that data remains evenly distributed even as the system scales, contributing to better load balancing and fault tolerance.

During peak load times or Surge, we will do a horizontal scaling of our system by adding more nodes to our system which allows for distributing the workload and data across multiple servers. This approach is preferable as it provides better scalability, fault tolerance, and flexibility to handle increasing data volume and user demand compared to vertical scaling

Fault tolerance and replication

We will need server replicas in case the Driver Location or Notification servers die. A secondary server can take control when a primary server dies. We can also store data in persistent storage like solid state drives (SSDs) to provide fast input and output. We can quickly use this persistent storage to recover data if both primary and secondary servers die.

Read/Write Separation:

Implementing read/write separation is beneficial for a ride-sharing platform like Uber. By separating read and write operations, it allows the system to optimize for performance and scalability. Read operations, such as retrieving ride details, user profiles, or driver information, are typically more frequent than write operations, such as updating ride statuses or booking new rides. Separating these operations enables the system to distribute the workload more effectively, scale read-heavy components independently, and improve overall system performance by reducing contention on the database resources. Additionally, it helps to ensure better fault tolerance and availability since read operations can still be served even if the write components experience issues or downtime.

High-level design

Frontend Application:
Mobile and web interfaces for passengers and drivers to interact with the platform.
Features include ride booking, tracking, payment processing, and communication.
Backend Services:
User Service: Handles user registration, authentication, and profile management.
Ride Service: Manages ride requests, matching, tracking, and ride history.
Driver Service: Manages driver registration, availability, and communication.
Payment Service: Processes payments, manages transactions, and handles financial data.
Rating Service: Manages rating and feedback for rides and drivers.
Notification Service: Sends real-time notifications to users regarding ride status updates.
Database Layer:
SQL Database: Stores structured data such as user profiles, ride details, payments, and ratings.
NoSQL Database: Stores unstructured or semi-structured data such as driver availability, real-time tracking updates, and notifications.
Geolocation Service:
Handles location-based operations, such as determining driver availability, calculating ride distances, and estimating arrival times.
Payment Gateway Integration:
Integrates with third-party payment gateways to facilitate secure payment transactions between passengers and drivers.
Messaging Service:
Facilitates real-time communication between passengers and drivers, enabling chat functionality and ride updates.
Analytics and Monitoring:
Tracks key metrics such as ride volume, user engagement, payment success rates, and driver performance.
Monitors system health, performance, and security to ensure uptime and reliability.
External Integrations:
Integrates with external services such as mapping and navigation APIs for route optimization, identity verification services for driver background checks, and regulatory compliance services.
Load Balancers and Caching Layers:
Distributes incoming traffic across multiple backend servers to ensure scalability and fault tolerance.
Caches frequently accessed data to reduce database load and improve response times.
Security Layer:
Implements robust security measures such as encryption, authentication, authorization, and data protection to safeguard user information and transactions.

Request flows

The below diagram shows the flow of what happens when the user books a ride.

Detailed component design

Location Tracking

Efficiently sending and receiving live location data from clients (customers and drivers) to the backend can be achieved using either a pull model or a push model. Let's explore both approaches:

Pull Model:
In the pull model, the backend periodically requests location updates from the clients.
Clients regularly send their current location to the server in response to these requests.
This approach is more suitable when the server needs updates at regular intervals, and the client's location updates are not frequent.
It can help reduce the amount of data transmitted and server load since updates are only sent when requested.
Push Model:
In the push model, the clients actively send location updates to the backend without waiting for requests.
Clients continuously stream their location data to the server in real-time.
This approach is more suitable for scenarios where real-time location updates are crucial, such as ride-tracking in a ride-sharing app.
It provides faster updates and responsiveness since location data is sent immediately without waiting for requests from the server.

Comparison:

Pull Model: Suitable for scenarios where periodic location updates are sufficient, reducing server load and data transmission.
Push Model: Ideal for real-time tracking scenarios requiring immediate location updates, providing faster responsiveness but potentially increasing server load and data transmission.

Decision:

For live location tracking in a ride-sharing platform like Uber, The pull model approach is not scalable as it will create unnecessary request overhead on our servers and most of the time the response will be empty, thus wasting our resources. The push model is typically preferred. It ensures real-time updates, enabling accurate ride tracking and timely driver dispatching. However, the choice depends on the specific use case requirements, such as update frequency and data volume, as well as considerations for server load and resource utilization.

Ride Matching

We need a way to efficiently store and query nearby drivers.

GeoHashing

Geohashing is a geocoding method used to encode geographic coordinates such as latitude and longitude into short alphanumeric strings. Geohash is a hierarchical spatial index that uses Base-32 alphabet encoding, the first character in a geohash identifies the initial location as one of the 32 cells. This cell will also contain 32 cells. This means that to represent a point, the world is recursively divided into smaller and smaller cells with each additional bit until the desired precision is attained.

San Francisco with coordinates 37.7564, -122.4016 can be represented in geohash as 9q8yy9mf. Now, using the customer's geohash we can determine the nearest available driver by simply comparing it with the driver's geohash. For better performance, we will index and store the geohash of the driver in memory for faster retrieval.

Quadtrees

A Quadtree is a tree data structure in which each internal node has exactly four children. They are often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. Each child or leaf node stores spatial information. We can update the Quadtree every time we receive a new location update from the driver. To reduce the load on the quadtree servers we can use an in-memory datastore such as Redis to cache the latest updates.

How to find the best drivers nearby?

Once we have a list of nearby drivers from the Quadtree servers, we can perform some sort of ranking based on parameters like average ratings, relevance, past customer feedback, etc. This will allow us to broadcast notifications to the best available drivers first.

Dealing with high demand

In cases of high demand, we can use the concept of Surge Pricing. Surge pricing is a dynamic pricing method where prices are temporarily increased as a reaction to increased demand and mostly limited supply. This surge price can be added to the base price of the trip.

When prices are surging, a multiplier to standard rates, an additional surge amount, or an upfront fare including the surge amount is shown to the customer. This will vary depending on city of the user. Surge pricing rates can be updated based on the demand in real time, surge can change quickly. Surge pricing is also specific to different areas in a city, so some neighborhoods may have surge pricing at the same time that other neighborhoods do not.

Notifications

Using a message queue or a message broker like Apache Kafka with the notification service is indeed a robust approach for dispatching push notifications efficiently. Here's how the setup works:

Notification Service:
Receives requests to send push notifications from various parts of the system, such as ride updates, chat messages, or payment confirmations.
Enqueues these requests into a message queue or a message broker for asynchronous processing.
Message Queue / Broker:
Acts as an intermediary between the notification service and the push notification delivery services (e.g., Firebase Cloud Messaging, Apple Push Notification Service).
Stores notification requests in a queue or topic, ensuring reliable and ordered delivery to the delivery service.
Delivery Service (e.g., FCM, APNS):
Retrieves notification requests from the message queue or topic.
Sends push notifications to user devices based on the provided recipient information and message content.
Handles device registration, message routing, and delivery status tracking.

Benefits of this approach:

Scalability: Message queues and brokers can handle high volumes of notification requests, allowing the system to scale horizontally to meet increasing demand.
Reliability: By decoupling the notification service from the delivery service, the system can ensure reliable message delivery even in the face of failures or fluctuations in load.
Asynchronous Processing: Using a message queue allows for asynchronous processing of notification requests, preventing blocking operations and improving system responsiveness.
Fault Tolerance: Message queues and brokers provide fault tolerance features such as message persistence and replication, ensuring that notification requests are not lost even in the event of system failures.

Trade offs/Tech choices

Asynchronous Communication with Message Queue:
Trade-off: While asynchronous communication with a message queue enhances system scalability and fault tolerance, it introduces complexity in managing message processing and potential message delivery failures.
Tech Choice: Despite the added complexity, we opted for asynchronous communication with a message queue to decouple components, improve responsiveness, and handle high volumes of notifications efficiently.
Database Choices: SQL vs. NoSQL:
Trade-off: SQL databases offer strong consistency and structured data storage but may face scalability challenges with increasing data volume. NoSQL databases provide better scalability and flexibility but may sacrifice some consistency guarantees.
Tech Choice: We opted for a combination of SQL and NoSQL databases to leverage the strengths of each. SQL databases are used for structured data such as user profiles and payments, while NoSQL databases handle unstructured or semi-structured data like real-time tracking updates and notifications.
Push Notification Delivery Service:
Trade-off: Choosing between Firebase Cloud Messaging (FCM) and Apple Push Notification Service (APNS) involves trade-offs in terms of platform support, features, and integration complexity.
Tech Choice: We selected FCM and APNS as they are widely adopted, offer robust features for push notification delivery across Android and iOS platforms, and provide comprehensive integration options with message queues like Apache Kafka.

Future improvements

To make our system more resilient we can do the following:

Running multiple instances of each of our services.
Introducing load balancers between clients, servers, databases, and cache servers.
Using multiple read replicas for our databases.
Multiple instances and replicas for our distributed cache.
Exactly once delivery and message ordering are challenging in a distributed system, we can use a dedicated message broker such as Apache Kafka or NATS to make our notification system more robust.