Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Network Connection Path Analyzer

by nectar4678

System requirements

Functional:

Find Shortest Connection Path: The system should determine the shortest connection path (in terms of hops) between two users in the professional network.
Connection Discovery: Users should be able to discover how they are connected with another user (first-degree, second-degree, etc.).
Query for Potential Connections: The system should provide a way to suggest potential contacts based on mutual connections.
Real-time Queries: The system should return results in real-time (or near real-time) for queries.
Profile Search: A search feature should be available to find users in the network and view their connection path.
Scalable Architecture: The system should be able to scale as the user base and connections grow.
API for External Access: Provide APIs for third-party applications to access connection paths between users.

Non-Functional:

Scalability: The system must handle up to 1 million users with an average of 200 connections per user.
Low Latency: Query response time should be under 1 second, even under peak loads (up to 1000 queries per second).
Availability: The system must ensure 99.99% uptime and be fault-tolerant.
Consistency: Ensure that the data is eventually consistent, with up-to-date connections visible as they are made.
Data Privacy: Handle user data securely, ensuring that users only see connection paths they are authorized to view.
Fault Tolerance: The system should handle hardware failures, network issues, and handle automatic recovery in case of a crash.

Capacity estimation

To estimate capacity, we can use the following assumptions:

Users: 1 million users.
Average Connections per User: 200 connections per user.
Peak Queries per Second (QPS): Up to 1000 queries during peak time.

Storage Estimation

We assume that the connections are stored as an adjacency list or graph structure in a database. Each connection consists of two user IDs, so the required storage space can be calculated as:

Total Connections: 1,000,000×200=200,000,0001,000,000 \times 200 = 200,000,0001,000,000×200=200,000,000 connections.
Each connection entry: Assuming each user ID is 8 bytes (64-bit), a connection between two users requires 16 bytes.
Total Storage for Connections: 200,000,000×16 bytes≈3.2 GB200,000,000 \times 16 \text{ bytes} \approx 3.2 \text{ GB}200,000,000×16 bytes≈3.2 GB.

We need to allocate additional space for indexing, metadata, and other entities, such as user profiles. Estimating around 50% overhead, the total space required for storing connections and metadata would be approximately 4.8-5 GB.

API design

Find Shortest Connection Path
Add New Connection
Search for User

Database design

High-level design

We can break the system into the following components:

API Gateway: Handles user requests and routes them to the appropriate microservices.
Graph Processor: The service that calculates shortest paths between users.
User Service: Manages user data, including profiles and connections.
Graph Database: Stores the graph structure of user connections.
Cache Layer: For faster retrieval of frequently queried connection paths.

Request flows

Find Shortest Connection Path

API Request:
- The client sends a GET request to /api/connection/shortest, providing the from_user_id and to_user_id.
API Gateway:
- The API Gateway receives the request and forwards it to the Graph Processor service for connection path calculation.
Graph Processor:
- The Graph Processor retrieves the necessary connection data from the Graph Database.
- If the requested path is complex or has been recently calculated, it first checks the Cache Layer to see if the result is already available.
Graph Database:
- If the path is not in the cache, the Graph Processor queries the Graph Database to fetch the connections related to both users (from_user_id and to_user_id).
- It uses a graph traversal algorithm (like Breadth-First Search) to find the shortest path between the two users.
Cache Update:
- After computing the shortest path, the result is stored in the Cache Layer for future queries, reducing the need for repeated calculations.
Response:
- The Graph Processor sends the shortest path result back to the API Gateway, which returns it to the client.

Add New Connection

API Request:
- The client sends a POST request to /api/connection/add, providing the two user IDs that are forming a connection.
API Gateway:
- The request is routed to the User Service, which handles all user-related data, including adding new connections.
User Service:
- The User Service validates the request and writes the new connection to the Graph Database.
Graph Database Update:
- The new connection is inserted into the Graph Database, updating the adjacency list or graph representation of the network.
Cache Invalidation:
- The system invalidates any cached connection paths that may involve either of the two users, ensuring that future queries fetch fresh data.
Response:
- The User Service returns a success message to the API Gateway, which sends the response back to the client.

Detailed component design

Graph Processor

The Graph Processor is responsible for calculating the shortest path between two users. Given that the system must scale to handle millions of users and hundreds of millions of connections, the Graph Processor must be designed to handle this efficiently.

Key Responsibilities:

Path Calculation: Implement graph traversal algorithms like Breadth-First Search (BFS) to calculate the shortest path (minimum hops) between two users.
Cache Querying: Before performing any graph calculations, check the Cache Layer for precomputed paths.
Load Balancing: Distribute graph processing tasks across multiple servers to ensure the system scales under high load.
Asynchronous Processing: For heavy queries, ensure that requests can be processed asynchronously to prevent blocking.

Algorithm Choice: Breadth-First Search (BFS)

Breadth-First Search (BFS) is ideal for finding the shortest path in an unweighted graph, such as a social network where each connection represents an equal "hop" between users.

Time Complexity: O(V+E)O(V + E)O(V+E), where VVV is the number of users (vertices) and EEE is the number of connections (edges). This is efficient given the structure of social networks.
Process:
1. Begin at the from_user_id node.
2. Explore each neighboring node (connected users), adding them to a queue.
3. Continue this exploration, keeping track of the number of hops until the to_user_id is found.
4. Return the path and hop count.

Scalability and Parallelism:

For larger queries (with high-degree nodes or long paths), the Graph Processor can distribute the BFS search across multiple instances using a distributed graph processing framework such as Apache Giraph or GraphX in Apache Spark.
The system can shard the user graph into smaller subgraphs, assigning each shard to a dedicated server for parallel processing.

Fault Tolerance:

Redundant Processing: If a graph processor instance fails while calculating a path, the request is retried on another instance.
Timeouts: Set timeout limits for BFS searches to prevent excessive resource consumption for large or complex queries.

Trade offs/Tech choices

Performance vs Complexity: While caching and horizontal scaling improve performance, they introduce complexity in managing cache invalidation and distributed data.
Consistency vs Availability: The system prioritizes availability over strict consistency, opting for eventual consistency to ensure responsiveness.
Simple Algorithm (BFS) vs More General (Dijkstra): BFS is chosen for simplicity and speed since we’re working with an unweighted graph, though it may need to change if we introduce weights in the future.

Failure scenarios/bottlenecks

Graph Database Bottleneck: Mitigated by replication, sharding, and graceful degradation.
Cache Invalidation Issues: Handled via event-driven invalidation and short TTL settings.
Cache Overload: Managed through distributed caching and autoscaling.
Graph Processor Failures: Addressed through redundancy, load balancing, and retry mechanisms.
Network Partitioning: Ensured by designing for partition tolerance and eventual consistency.
API Gateway Overload: Controlled with autoscaling, rate limiting, and circuit breakers.
User Data Privacy Violations: Prevented through access control and data encryption.

Future improvements

Introduce weighted edges in the graph to reflect the strength of a connection (e.g., close colleagues vs. distant acquaintances). This would allow for more meaningful pathfinding queries that consider relationship strength, not just the shortest path in terms of hops.
Implement a real-time event-driven architecture to handle updates to the graph. With real-time updates, users can see new connections instantly without any delay.
Use approximate query algorithms (like Approximate Nearest Neighbor or Sketching Algorithms) to optimize pathfinding for users with many connections, sacrificing some accuracy for much faster responses.

Supports markdown