Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Key Value Store with Score: 8/10

by iridescent_luminous693

System requirements

Functional Requirements

Basic Operations:
- Support set, get, delete, and update operations.
- Enable conditional updates (e.g., compare-and-swap).
Data Consistency:
- Provide configurable consistency models (e.g., eventual consistency or strong consistency).
Data Partitioning:
- Use consistent hashing or range-based partitioning to distribute keys across nodes.
Replication and Fault Tolerance:
- Replicate data across multiple nodes for durability and availability.
- Ensure failover and recovery in case of node failures.
Persistence:
- Provide options for durable storage to prevent data loss (e.g., snapshots, write-ahead logs).
Scalability:
- Scale horizontally by adding more nodes to handle increased load.
High Availability:
- Ensure the system remains operational during node failures or maintenance.
Efficient Querying:
- Optimize for low-latency reads and writes.
- Support batch operations for efficiency.

Non-Functional Requirements

Performance:
- Maintain sub-millisecond latency for read and write operations under normal loads.
Scalability:
- Handle millions of operations per second and scale seamlessly as the cluster grows.
Availability:
- Ensure 99.99% uptime, supporting redundancy and failover mechanisms.
Durability:
- Guarantee data persistence with replication and backups.
Security:
- Secure communication with encryption and access controls (e.g., role-based access).
Consistency:
- Provide configurable options for strong or eventual consistency.

Capacity estimation

Estimate the scale of the system you are going to design...

Assumptions:

Keys and Values:
- Average key size: 50 bytes.
- Average value size: 500 bytes.
- Total entries: 1 billion keys.
Storage:
- Total data size: (50+500)×1 billion=550 GB(50 + 500) \times 1 \, \text{billion} = 550 \, \text{GB}(50+500)×1billion=550GB.
- Replication factor: 3.
- Total storage required: 550 GB×3=1.65 TB550 \, \text{GB} \times 3 = 1.65 \, \text{TB}550GB×3=1.65TB.
Requests:
- Daily operations: 10 billion (get 80%, set 15%, delete 5%).
- Peak operations per second: 10 billion÷86,400=115,740 RPS10 \, \text{billion} \div 86,400 = 115,740 \, \text{RPS}10billion÷86,400=115,740RPS.
Nodes:
- Node capacity: 100 GB.
- Nodes required (with replication): 1.65 TB÷100 GB=17 nodes1.65 \, \text{TB} \div 100 \, \text{GB} = 17 \, \text{nodes}1.65TB÷100GB=17nodes.

API design

Define what APIs are expected from the system...

1. Core Data Operations APIs

POST /api/set:
- Input: { key: string, value: string, ttl: int (optional) }.
- Output: { success: boolean }.
- Stores a key-value pair with an optional time-to-live (TTL).
GET /api/get/{key}:
- Input: { key: string }.
- Output: { value: string, ttl: int (if applicable) }.
- Retrieves the value associated with a key.
DELETE /api/delete/{key}:
- Input: { key: string }.
- Output: { success: boolean }.
- Deletes the specified key-value pair.
POST /api/compare_and_set:
- Input: { key: string, old_value: string, new_value: string }.
- Output: { success: boolean }.
- Updates the value only if it matches the old value.

2. Cluster Management APIs

POST /api/nodes/add:
- Adds a new node to the cluster and rebalances data.
DELETE /api/nodes/remove/{node_id}:
- Removes a node and redistributes its data.
GET /api/nodes/status:
- Retrieves the health and status of all nodes in the cluster.

3. Monitoring and Metrics APIs

GET /api/metrics:
- Provides system performance metrics (e.g., throughput, latency).
GET /api/replication/status:
- Retrieves replication status for all keys.

Database design

Defining the system data model early on will clarify how data will flow among different components of the system. Also you could draw an ER diagram using the diagramming tool to enhance your design...

1. Key-Value Store Database

Schema Details:
- Table Name: KeyValueStore
  - key (Primary Key): Unique identifier for each key.
  - value: Associated value for the key.
  - ttl: Time-to-live in seconds (optional).
  - replica_nodes: List of nodes storing replicas.
  - updated_at: Last modification timestamp.
Purpose:
- Store and manage key-value pairs with replication and TTL support.
Tech Used:
- Distributed NoSQL Databases (e.g., DynamoDB, Cassandra).
Tradeoff:
- Pros: High scalability, fault tolerance, and low-latency operations.
- Cons: Limited support for complex queries.

2. Replication Log Database

Schema Details:
- Table Name: ReplicationLog
  - log_id (Primary Key): Unique identifier for the log entry.
  - key: Key being replicated.
  - source_node: Node from which the key is replicated.
  - destination_node: Node to which the key is replicated.
  - status: Replication status (e.g., in-progress, completed).
  - timestamp: Timestamp of the replication event.
Purpose:
- Track and manage data replication across nodes.
Tech Used:
- Time-Series Databases (e.g., InfluxDB).
Tradeoff:
- Pros: Optimized for sequential writes and replication tracking.
- Cons: Less suited for complex joins or relational queries.

3. Cluster Metadata Database

Schema Details:
- Table Name: ClusterMetadata
  - node_id (Primary Key): Unique identifier for each node.
  - capacity: Storage capacity of the node.
  - available_space: Remaining space on the node.
  - health_status: Health of the node (e.g., healthy, degraded).
  - updated_at: Last status update timestamp.
Purpose:
- Manage metadata for cluster nodes, including health and storage status.
Tech Used:
- Relational Databases (e.g., PostgreSQL).
Tradeoff:
- Pros: Ensures consistency and integrity for cluster metadata.
- Cons: Requires sharding or partitioning for scalability.

High-level design

You should identify enough components that are needed to solve the actual problem from end to end. Also remember to draw a block diagram using the diagramming tool to augment your design. If you are unfamiliar with the tool, you can simply describe your design to the chat bot and ask it to generate a starter diagram for you to modify...

1. API Gateway

Overview:

Acts as the entry point for all client requests, handling routing, authentication, rate limiting, and load balancing. It ensures secure and efficient communication with backend services.

Responsibilities:

Authenticate incoming requests (e.g., API tokens).
Route requests to appropriate backend services.
Enforce rate limits and quotas for client requests.

2. Partitioning and Routing Service

Overview:

Determines which node stores a given key and routes requests to the correct node. It implements consistent hashing or range-based partitioning for efficient data distribution.

Responsibilities:

Map keys to nodes using partitioning algorithms.
Handle node additions/removals and rebalance partitions.
Provide fault-tolerant routing during node failures.

3. Storage Nodes

Overview:

Store the actual key-value pairs and serve get, set, and delete operations. Each node is responsible for a specific subset of the key space.

Responsibilities:

Store, retrieve, and delete key-value pairs.
Ensure data durability with replication or persistence mechanisms.
Respond to replication or rebalancing requests.

4. Replication Manager

Overview:

Manages data replication across nodes to ensure fault tolerance and availability. It tracks the replication status and re-replicates data from failed nodes to healthy ones.

Responsibilities:

Enforce the replication factor for each key.
Monitor node health and redistribute data as needed.
Handle consistency during replication (e.g., quorum-based updates).

5. Persistence Layer

Overview:

Provides durable storage for key-value pairs, ensuring no data loss even after node failures or crashes. It uses mechanisms like write-ahead logging (WAL) or snapshots.

Responsibilities:

Store data on disk or other persistent media.
Periodically take snapshots to enable faster recovery.
Manage log compaction for efficiency.

6. Monitoring and Health Manager

Overview:

Tracks system health, performance, and usage metrics. It detects failures and triggers alerts or recovery actions when anomalies are detected.

Responsibilities:

Monitor node health and resource usage (CPU, memory, disk).
Log metrics like request latency and throughput.
Notify administrators about failures or performance degradation.

7. Cluster Manager

Overview:

Handles cluster-wide operations like adding/removing nodes, rebalancing partitions, and managing configuration changes.

Responsibilities:

Rebalance the key space when nodes join or leave.
Maintain a consistent view of the cluster state across all nodes.
Coordinate with other services for seamless scaling and maintenance

Request flows

Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...

1. Set Request

Objective: Store a key-value pair in the system.

Steps:

API Gateway:
- Receives a POST /api/set request with the key-value pair.
- Authenticates the client and enforces rate limits.
Partitioning and Routing Service:
- Determines the storage node responsible for the key using consistent hashing.
- Routes the request to the primary node for the partition.
Storage Node:
- Writes the key-value pair to its in-memory store.
- Logs the operation in the write-ahead log (WAL) for persistence.
Replication Manager:
- Replicates the key-value pair to secondary nodes.
- Ensures replication is successful and notifies the primary node.
Response:
- Confirms the success of the operation to the client.

2. Get Request

Objective: Retrieve the value associated with a key.

Steps:

API Gateway:
- Receives a GET /api/get/{key} request.
- Authenticates the client and enforces rate limits.
Partitioning and Routing Service:
- Identifies the primary node for the key and routes the request.
Storage Node:
- Looks up the key in its in-memory store or on-disk storage.
- Verifies data consistency if requested (e.g., quorum reads).
Response:
- Returns the value to the client.

3. Delete Request

Objective: Remove a key-value pair from the system.

Steps:

API Gateway:
- Receives a DELETE /api/delete/{key} request.
- Authenticates the client and enforces rate limits.
Partitioning and Routing Service:
- Determines the node responsible for the key and routes the request.
Storage Node:
- Deletes the key-value pair from its in-memory store and updates the WAL.
- Notifies the Replication Manager to propagate the deletion to replicas.
Replication Manager:
- Ensures the key is deleted across all replicas.
Response:
- Confirms the success of the deletion to the client.

4. Conditional Update Request

Objective: Update the value of a key only if it matches an existing value.

Steps:

API Gateway:
- Receives a POST /api/compare_and_set request with the key, old value, and new value.
- Authenticates the client and enforces rate limits.
Partitioning and Routing Service:
- Routes the request to the primary node for the key.
Storage Node:
- Checks if the current value matches the provided old value.
- Updates the value if the condition is met and logs the change in the WAL.
Replication Manager:
- Replicates the updated value to secondary nodes.
Response:
- Confirms success or failure of the conditional update.

5. Node Addition

Objective: Add a new node to the cluster and rebalance data.

Steps:

Cluster Manager:
- Adds the new node to the cluster metadata.
- Rebalances the key space by redistributing partitions.
Partitioning and Routing Service:
- Updates the consistent hashing ring to include the new node.
Replication Manager:
- Replicates data from existing nodes to the new node.
Monitoring Service:
- Tracks the new node’s health and usage metrics.
Response:
- Confirms the addition of the new node and rebalancing completion.

6. Node Failure Recovery

Objective: Recover data from a failed node.

Steps:

Monitoring and Health Manager:
- Detects node failure through heartbeat checks.
- Notifies the Cluster Manager and Replication Manager.
Replication Manager:
- Identifies data stored on the failed node.
- Re-replicates data from existing replicas to healthy nodes.
Partitioning and Routing Service:
- Updates the routing map to exclude the failed node.
Response:
- Confirms recovery and restores system redundancy.

Detailed component design

Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...

2. Partitioning and Routing Service

End-to-End Working:

This service determines which storage node is responsible for a given key using partitioning algorithms like consistent hashing. When a key is provided (e.g., for get or set operations), it calculates the hash, finds the appropriate partition, and routes the request to the responsible node.

Communication:

Protocols Used:
- REST or gRPC: For communicating with the Storage Nodes.
Inter-Service Communication:
- Interacts with the Storage Nodes to perform data operations.
- Coordinates with the Cluster Manager during rebalancing events.

Data Structures and Algorithms:

Consistent Hashing:
- Maps keys to nodes on a virtual hash ring. Nodes are assigned multiple virtual slots to ensure even distribution.
Routing Table:
- Maintains a mapping of partitions to nodes, updated dynamically during scaling.

Implementation Example (Consistent Hashing):

python
Copy code
class ConsistentHashing:
    def __init__(self, replicas=3):
        self.replicas = replicas
        self.ring = SortedDict()

    def add_node(self, node):
        for i in range(self.replicas):
            key = hash(f"{node}-{i}")
            self.ring[key] = node

    def get_node(self, key):
        if not self.ring:
            return None
        hashed_key = hash(key)
        node_index = self.ring.bisect_left(hashed_key) % len(self.ring)
        return self.ring.peekitem(node_index)[1]

Scaling for Peak Traffic:

Dynamic Rebalancing:
- Reassigns partitions to new nodes dynamically as the cluster grows.
Efficient Hashing:
- Uses consistent hashing to minimize data movement during scaling.

Edge Cases:

Node Failures:
- Routes requests to replica nodes to maintain availability.
Hot Partitions:
- Mitigated by splitting partitions or using weighted hashing.

3. Storage Nodes

End-to-End Working:

Storage nodes are responsible for storing key-value pairs and ensuring data durability. For set requests, the data is stored in memory and logged in a write-ahead log (WAL) for persistence. For get requests, the node retrieves the value from memory or disk.

Communication:

Protocols Used:
- REST or gRPC: For receiving requests from the Partitioning and Routing Service.
Inter-Service Communication:
- Notifies the Replication Manager about updates for data replication.

Data Structures and Algorithms:

In-Memory Storage:
- Uses a hash map for fast key-value lookups.
Write-Ahead Log (WAL):
- Logs every write operation to disk for crash recovery.
TTL Index:
- Maintains an ordered map of keys with expiration times for TTL-based operations.

Implementation Example (In-Memory Storage with WAL):

python
Copy code
class StorageNode:
    def __init__(self):
        self.store = {}
        self.wal = open("wal.log", "a+")

    def set(self, key, value):
        self.store[key] = value
        self.wal.write(f"SET {key} {value}\n")

    def get(self, key):
        return self.store.get(key)

Scaling for Peak Traffic:

Horizontal Scaling:
- Add more nodes to distribute the key space.
Sharding:
- Split large partitions into smaller shards.

Edge Cases:

Data Corruption:
- Validates data integrity using checksums.
Disk Full:
- Implements disk space monitoring and alerts.

4. Replication Manager

End-to-End Working:

The Replication Manager ensures data redundancy by replicating key-value pairs to secondary nodes. It monitors the replication factor and re-replicates data during node failures.

Communication:

Protocols Used:
- gRPC: For initiating replication tasks with storage nodes.
Inter-Service Communication:
- Interacts with the Monitoring Service to detect node failures.

Data Structures and Algorithms:

Replication Tracker:
- Maintains a list of replica nodes for each key.
Quorum-Based Replication:
- Ensures consistency by requiring a quorum of nodes to acknowledge writes.

Implementation Example (Replication Tracker):

python
Copy code
class ReplicationManager:
    def __init__(self):
        self.replica_map = {}

    def replicate(self, key, nodes):
        self.replica_map[key] = nodes
        for node in nodes:
            node.set(key, self.store[key])

Scaling for Peak Traffic:

Asynchronous Replication:
- Handles replication tasks asynchronously to reduce latency.
Batch Replication:
- Groups updates to reduce network overhead.

Edge Cases:

Replication Delays:
- Uses timeouts and retries to ensure eventual consistency.
Conflicts:
- Resolves conflicts using version vectors or timestamps.

5. Cluster Manager

End-to-End Working:

The Cluster Manager handles cluster-wide operations like adding/removing nodes and rebalancing partitions. It ensures a consistent view of the cluster state and coordinates updates with other services.

Communication:

Protocols Used:
- gRPC: Communicates with all nodes for rebalancing tasks.
Inter-Service Communication:
- Coordinates with the Partitioning and Routing Service for updates.

Data Structures and Algorithms:

Membership Table:
- Tracks the status and capacity of all nodes.
Rebalancing Algorithm:
- Redistributes partitions to ensure even load distribution.

Scaling for Peak Traffic:

Dynamic Scaling:
- Automatically adds or removes nodes based on load metrics.

Edge Cases:

Split Brain Scenarios:
- Mitigated using consensus algorithms like Raft.

Trade offs/Tech choices

Explain any trade offs you have made and why you made certain tech choices...

Consistent Hashing vs. Range-Based Partitioning:

Trade-off: Consistent hashing minimizes data movement during scaling but can lead to uneven load distribution without virtual nodes.
Reason: Chose consistent hashing with virtual nodes to balance simplicity and even partition distribution.

Replication Factor of 3:

Trade-off: Increased storage costs due to multiple replicas.
Reason: Ensures high availability and fault tolerance in case of node failures.

In-Memory vs. Persistent Storage:

Trade-off: In-memory storage offers low latency but risks data loss without persistence.
Reason: Combined in-memory caching with a write-ahead log for durability and performance.

Eventual Consistency vs. Strong Consistency:

Trade-off: Eventual consistency may lead to temporary inconsistencies in replicas.
Reason: Prioritized performance and availability for non-critical operations while offering optional strong consistency.

Failure scenarios/bottlenecks

Try to discuss as many failure scenarios/bottlenecks as possible.

Node Failures:

Issue: Data stored on failed nodes becomes unavailable.
Mitigation: Use replication to maintain multiple copies and redistribute data during recovery.

Hot Partitions:

Issue: Uneven load distribution may overload certain nodes.
Mitigation: Use consistent hashing with virtual nodes to balance the load across the cluster.

Replication Delays:

Issue: Slow replication may lead to stale data in replicas.
Mitigation: Implement asynchronous replication with versioning to track updates.

Network Partitions:

Issue: Nodes in different partitions may serve inconsistent data.
Mitigation: Use quorum-based reads and writes to ensure consistency.

Storage Limitations:

Issue: Nodes running out of disk space can cause write failures.
Mitigation: Monitor storage usage and rebalance data proactively.

Future improvements

What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?

Dynamic Scaling:

Improvement: Automate node addition/removal based on load.
Mitigation: Use monitoring tools to trigger scaling actions and rebalance data dynamically.

Erasure Coding:

Improvement: Reduce storage costs by replacing full replication with erasure coding.
Mitigation: Use efficient reconstruction algorithms to minimize latency during reads.

Geo-Replication:

Improvement: Deploy replicas across multiple regions for lower latency and disaster recovery.
Mitigation: Use asynchronous cross-region replication with eventual consistency.

Advanced Conflict Resolution:

Improvement: Introduce CRDTs (Conflict-Free Replicated Data Types) to handle conflicts gracefully in an eventually consistent system.
Mitigation: Ensure that updates are commutative, associative, and idempotent.

Improved Monitoring and Self-Healing:

Improvement: Implement AI-based anomaly detection to predict failures.
Mitigation: Trigger automated self-healing processes for node recovery or data rebalancing.