Design a Key Value Store with Score: 8/10
by alchemy1135
System requirements
Functional:
- Set Operation: The system should allow clients to set a key-value pair, where the key is unique and the value can be any data.
- Get Operation: Clients should be able to retrieve the value associated with a given key from the system.
- Delete Operation: There should be a way for clients to delete a key-value pair from the system.
- Conditional Updates: Clients should be able to perform updates on a key only if certain conditions are met (e.g., only update if the key exists).
Non-Functional:
- Data Replication: The system should replicate data across multiple nodes to ensure fault tolerance and high availability.
- Partitioning: Data should be partitioned to distribute it evenly across the cluster of nodes, allowing for efficient data access and scalability.
- Persistence Options: The system should provide options for persisting data to safeguard against data loss in case of failures.
- Consistency Guarantees: The system should ensure consistency in data access and updates, maintaining the integrity of the stored information.
- Durability Guarantees: Data should be durably stored to prevent data loss even in the face of failures.
Capacity estimation
- Storage Capacity Utilization:
- With 600 TB of total storage capacity, we need to consider both the data size and any overhead introduced by replication, indexes, and metadata.
- Assuming a replication factor of 3 for fault tolerance and high availability, the effective usable storage capacity reduces to approximately 200 TB (600 TB / 3).
- Additionally, we need to factor in any overhead introduced by indexing and metadata, typically ranging from 20% to 50% of the actual data size.
- Considering these factors, the estimated effective storage capacity available for actual data storage would be around 140 TB to 160 TB.
- Query Per Second (QPS) and Throughput:
- With a maximum QPS of 100,000, the system needs to be capable of handling this level of throughput efficiently.
- Each query operation (set, get, delete, etc.) would consume system resources such as CPU, memory, and network bandwidth.
- It's essential to ensure that the system architecture and hardware configuration can support the required throughput without degradation in performance.
- Key Value Size:
- Given that the maximum key vault size is 10 KB, it's essential to consider the distribution of key sizes and their impact on overall storage utilization.The number of key-value pairs that can be stored in 200 GB of storage space, considering an average size of 10 KB per key-value pair, is approximately 20,971,520 key-value pairs.
API design
- Set Operation API: This API allows clients to set a key-value pair in the system.
- Parameters:
- Key: The unique identifier for the data.
- Value: The data to be stored, associated with the specified key.
- Return Value:
- Status: Indicates whether the operation was successful or encountered an error.
- Get Operation API: This API allows clients to retrieve the value associated with a given key from the system.
- Parameters:
- Key: The unique identifier for the data.
- Return Value:Value: The data associated with the specified key.
- Status: Indicates whether the operation was successful or encountered an error.
- Delete Operation API: This API allows clients to delete a key-value pair from the system.
- Parameters:
- Key: The unique identifier for the data to be deleted.
- Return Value:
- Status: Indicates whether the operation was successful or encountered an error.
- Conditional Update Operation API: This API allows clients to perform updates on a key only if certain conditions are met.
- Parameters:
- Key: The unique identifier for the data to be updated.
- Value: The new data to be stored.
- Return Value:
- Status: Indicates whether the update operation was successful or the condition was not met.
- List Keys API:
- This API allows clients to retrieve a list of keys stored in the system.
- Parameters: None
- Return Value:Keys: A list of unique identifiers for the data stored in the system.
- Status: Indicates whether the operation was successful or encountered an error.
- Get Operation with Specified Version API: This API allows clients to retrieve the value associated with a given key from the system based on a specified version.
- Parameters:
- Key: The unique identifier for the data.
- Version: The version number or timestamp associated with the value that the client wants to retrieve.
- Return Value:
- Value: The data associated with the specified key and version.
- Status: Indicates whether the operation was successful or encountered an error.
Database design
Data Partitioning using Consistent Hashing:
Data partitioning is crucial for distributing data across multiple nodes in a distributed key-value store system to achieve scalability and efficient data access. Consistent hashing is a popular technique used for data partitioning in distributed systems.
- Consistent Hashing Overview:
- Consistent hashing is a hashing technique that minimizes the need for rehashing when the number of slots or nodes in the system changes.
- It provides a way to map keys to nodes in a distributed system in a consistent manner, ensuring that each key is assigned to a specific node regardless of changes in the system's topology.
- Consistent hashing achieves this by mapping both keys and nodes onto a common hash ring, where each node and key is associated with a point on the ring.
- Using Consistent Hashing for Data Partitioning:
- In a distributed key-value store system, consistent hashing can be used to partition data across a cluster of nodes.
- Each node in the cluster is assigned a range of hash values on the hash ring, forming a virtual "token" space.
- When a key needs to be stored or retrieved, its hash value is computed, and the corresponding node responsible for that hash range is determined.
- This ensures that each key is consistently mapped to the same node, allowing for efficient data access and distribution across the cluster.
- Advantages of Consistent Hashing:
- Load Balancing: Consistent hashing distributes data evenly across nodes, preventing hotspots and ensuring balanced load distribution.
- Scalability: As the cluster size changes (nodes added or removed), only a fraction of keys need to be remapped, minimizing the impact on the system.
- Fault Tolerance: In case of node failures or additions, consistent hashing allows the system to redistribute data efficiently without significant data movement.
- Considerations for Consistent Hashing:
- Replica Handling: Consistent hashing can be extended to handle data replication by assigning multiple replicas for each key across different nodes.
- Virtual Nodes: To improve load balancing and reduce data movement during node additions or failures, virtual nodes can be used, where each physical node is represented by multiple virtual nodes on the hash ring.
Data Replication Strategies:
Data replication ensures fault tolerance and high availability by storing multiple copies of data across different nodes in the system. Common data replication strategies include:
- Full Replication:
- Every piece of data is replicated across all nodes in the system.
- Provides strong fault tolerance but may lead to high storage overhead and network traffic.
- Partial Replication:
- Each piece of data is replicated only on a subset of nodes.
- Reduces storage overhead compared to full replication but may require careful placement of replicas to ensure fault tolerance.
- Quorum-based Replication:
- Data is replicated across a subset of nodes, and read/write operations require a quorum (a minimum number of replicas) to be successful.
- Provides a balance between fault tolerance, consistency, and performance by allowing tunable consistency levels.
For our design we will go with Quorum-based replication, this will be explained in detail in the detailed component design section.
Consistency and Types of Consistency:
Consistency in a distributed key-value store system refers to the agreement of data across multiple replicas. Different consistency models offer varying levels of guarantees:
- Strong Consistency:
- All replicas return the same value for read operations, ensuring that clients always see the most up-to-date data.
- Achieved through synchronous replication and coordination mechanisms such as distributed transactions or strict quorums.
- Eventual Consistency:
- Replicas may temporarily diverge but eventually converge to a consistent state.
- Allows for higher availability and better performance but may lead to temporary inconsistencies visible to clients.
- Read Consistency Levels:
- Read operations can be tuned to provide different consistency guarantees, such as strong consistency, eventual consistency, or read-your-writes consistency, where a client always sees its own writes.
Strong consistency is usually achieved by forcing a replica not to accept new reads/writes until every replica has agreed on current write. This approach is not ideal for highly available systems because it could block new operations. Dynamo and Cassandra adopt eventual consistency, which is our recommended consistency model for our key-value store.
High-level design
- Client Interface:
- Provides an interface for clients to interact with the system. It accepts requests for key-value operations (e.g., set, get, delete) from clients and forwards them to the appropriate components.
- API Gateway:
- Acts as a single entry point for clients to access the system's APIs. It handles client requests, performs authentication and authorization, and routes requests to the corresponding services.
- Load Balancer:
- Distributes incoming client requests across multiple instances of the web server or API Gateway to ensure optimal resource utilization and scalability.
- Web Server:
- Hosts the APIs and serves as the interface between clients and the key-value service. It handles HTTP requests from clients and forwards them to the KeyValueService for processing.
- KeyValueService:
- The core component responsible for handling key-value operations. It manages data storage, retrieval, updates, replication, partitioning, and consistency guarantees.
- Database:
- Stores the actual data in the key-value store system. It may consist of one or more data storage technologies such as distributed databases, in-memory databases, or disk-based storage systems.
- Replication Manager:
- Manages data replication across multiple nodes in the system to ensure fault tolerance and high availability. It handles replication synchronization, conflict resolution, and consistency maintenance.
- Partitioning Manager:
- Responsible for partitioning data across nodes in the cluster. It ensures even distribution of data to facilitate efficient data access, scalability, and load balancing.
- Monitoring Agent:
- Monitors the health, performance, and status of various components in the system. It collects metrics, logs, and events to provide visibility into the system's operation and detect any anomalies or issues.
Request flows
Below sequence diagram is for how user makes basic operations.
Detailed component design
Data Storage choices : Redis vs Cassandra
When choosing between technologies like Redis and Cassandra for data storage in a distributed key-value store system, several factors need to be considered to make an informed decision. Here are some selection criteria and details on each:
- Data Model and Use Case:
- Redis: Redis is an in-memory data store that supports a wide range of data structures, including strings, hashes, lists, sets, and sorted sets. It is well-suited for use cases requiring fast read and write operations on relatively small datasets, such as caching, session management, real-time analytics, and pub/sub messaging.
- Cassandra: Cassandra is a distributed database that offers a highly scalable and fault-tolerant architecture optimized for handling large volumes of data across multiple nodes. It is suitable for use cases requiring high availability, linear scalability, and eventual consistency, such as time-series data storage, logging, messaging systems, and user profile management.
- Consistency Requirements:
- Redis: Redis typically offers strong consistency guarantees within a single node or data center, making it suitable for applications requiring immediate and predictable data consistency.
- Cassandra: Cassandra offers tunable consistency levels, allowing developers to choose between strong consistency, eventual consistency, or anything in between based on application requirements. It provides flexible consistency models that can adapt to different use cases, including high availability scenarios where eventual consistency is acceptable.
- Scalability and Performance:
- Cassandra is designed for linear scalability, allowing organizations to easily add new nodes to the cluster as data volumes and traffic increase, so it is a better choice here.
- Durability and Persistence:
- Cassandra provides built-in durability and fault tolerance by replicating data across multiple nodes in the cluster. It offers tunable durability options, including configurable replication factors, consistency levels, and durability settings, ensuring that data remains available and durable even in the face of node failures or data center outages.
Inconsistency resolution and concurrent updates for key-values : Versioning
Replication gives high availability but causes inconsistencies among replicas. Versioning and vector locks are used to solve inconsistency problems. Versioning means treating each data modification as a new immutable version of data. Before we talk about versioning, let us use an example to explain how inconsistency happens:
lets assume there are 2 nodes, both replica nodes n1 and n2 have the same value. Let us call this value the original value. Server 1 and server 2 get the same value for get(“name”) operation
Next, server 1 changes the name to “johnSanFrancisco”, and server 2 changes the name to “johnNewYork” as shown in Figure 8. These two changes are performed simultaneously. Now, we have conflicting values, called versions v1 and v2.
In this example, the original value could be ignored because the modifications were based on it. However, there is no clear way to resolve the conflict of the last two versions. To resolve this issue, we need a versioning system that can detect conflicts and reconcile conflicts. A vector clock is a common technique to solve this problem.
Let us examine how vector clocks work.
A vector clock is a [server, version] pair associated with a data item. It can be used to check if one version precedes, succeeds, or in conflict with others.
Assume a vector clock is represented by D([S1, v1], [S2, v2], …, [Sn, vn]), where D is a data item, v1 is a version counter, and s1 is a server number, etc. If data item D is written to server Si, the system must perform one of the following tasks.
- Increment vi if [Si, vi] exists.
- Otherwise, create a new entry [Si, 1].
The above abstract logic is explained with a concrete example .
- A client writes a data item D1 to the system, and the write is handled by server Sx, which now has the vector clock D1[(Sx, 1)].
- Another client reads the latest D1, updates it to D2, and writes it back. D2 descends from D1 so it overwrites D1. Assume the write is handled by the same server Sx, which now has vector clock D2([Sx, 2]).
- Another client reads the latest D2, updates it to D3, and writes it back. Assume the write is handled by server Sy, which now has vector clock D3([Sx, 2], [Sy, 1])).
- Another client reads the latest D2, updates it to D4, and writes it back. Assume the write is handled by server Sz, which now has D4([Sx, 2], [Sz, 1])).
- When another client reads D3 and D4, it discovers a conflict, which is caused by data item D2 being modified by both Sy and Sz. The conflict is resolved by the client and updated data is sent to the server. Assume the write is handled by Sx, which now has D5([Sx, 3], [Sy, 1], [Sz, 1]). We will explain how to detect conflict shortly.
Vector Clocks
Using vector clocks, it is easy to tell that a version X is an ancestor (i.e. no conflict) of version Y if the version counters for each participant in the vector clock of Y is greater than or equal to the ones in version X. For example, the vector clock D([s0, 1], [s1, 1])] is an ancestor of D([s0, 1], [s1, 2]). Therefore, no conflict is recorded.
Similarly, you can tell that a version X is a sibling (i.e., a conflict exists) of Y if there is any participant in Y's vector clock who has a counter that is less than its corresponding counter in X. For example, the following two vector clocks indicate there is a conflict: D([s0, 1], [s1, 2]) and D([s0, 2], [s1, 1]).
Even though vector clocks can resolve conflicts, there are two notable downsides.
- Vector clocks add complexity to the client because it needs to implement conflict resolution logic.
- the [server: version] pairs in the vector clock could grow rapidly. To fix this problem, we set a threshold for the length, and if it exceeds the limit, the oldest pairs are removed.
This can lead to inefficiencies in reconciliation because the descendant relationship cannot be determined accurately. However, based on Dynamo paper, Amazon has not yet encountered this problem in production; therefore, it is probably an acceptable solution for most companies.
Quorum-based replication
Quorum-based replication is a data replication strategy commonly used in distributed systems to balance consistency, availability, and partition tolerance. In quorum-based replication, read and write operations require acknowledgment from a subset of replicas known as a quorum. By adjusting the quorum size, system designers can control the trade-off between consistency and availability.
Example of Quorum-Based Replication:
Consider a distributed key-value store system with a total of 5 replicas (nodes) replicating data across the cluster. In this example, we'll explore how quorum-based replication works for read and write operations.
- Write Operation:
- When a client initiates a write operation (e.g., set key-value pair), the system requires acknowledgment from a majority of replicas to consider the operation successful.
- Let's assume the quorum size is set to 3, meaning at least 3 replicas must acknowledge the write operation for it to succeed.
- The client sends the write request to the replicas and waits for acknowledgments.
- Once the client receives acknowledgments from 3 replicas (a majority), it considers the write operation successful and returns a confirmation to the client.
- Read Operation:
- For read operations (e.g., get value for a key), the system can tune the consistency level by adjusting the quorum size.
- If strong consistency is desired, the quorum size may be set to a majority (e.g., 3 out of 5 replicas).
- The client sends a read request to the replicas and waits for responses from the quorum.
- Once the client receives responses from a majority of replicas, it returns the value to the client.
- Since the quorum ensures that a majority of replicas have acknowledged the read operation, the client is guaranteed to receive the most up-to-date value.
Benefits of Quorum-Based Replication:
- Tunable Consistency Levels:
- Quorum-based replication allows system administrators to adjust the quorum size to achieve the desired consistency level.
- By selecting appropriate quorum sizes, systems can balance between strong consistency and availability according to application requirements.
- Fault Tolerance:
- Quorum-based replication provides fault tolerance by ensuring that a subset of replicas can continue to function even if some replicas are unavailable or failed.
- As long as a quorum of replicas is reachable, the system can continue to serve read and write requests.
- Scalability:
- Quorum-based replication scales well with the size of the cluster since the quorum size can be adjusted accordingly.
- Adding more replicas to the cluster allows for increased fault tolerance and scalability without sacrificing consistency.
Overall, quorum-based replication offers a flexible and robust approach to data replication in distributed key-value store systems, allowing system designers to tailor consistency levels to meet the specific needs of their applications.
Data center outage and Disaster-Recovery
Handling data center outages is crucial for ensuring the availability and reliability of a distributed key-value store system, especially in scenarios such as power outages, network failures, natural disasters, or other unforeseen events. Here are some strategies and considerations for building a system capable of handling data center outages:
- Replication Across Multiple Data Centers:
- Replicating data across multiple geographically distributed data centers is essential for ensuring fault tolerance and high availability.
- Each data center serves as a replica of the data, allowing users to access data even if one or more data centers are offline.
- Cross-Data Center Replication (XDCR):
- Implementing cross-data center replication mechanisms ensures that changes made to data in one data center are asynchronously replicated to other data centers.
- XDCR provides redundancy and ensures that data remains consistent across all data centers in the event of an outage.
- Quorum-Based Replication:
- Utilize quorum-based replication strategies to handle read and write operations across multiple data centers.
- By requiring acknowledgment from a quorum of replicas distributed across different data centers, the system can ensure consistency and availability even during outages.
- Load Balancing and Failover Mechanisms:
- Implement load balancers and failover mechanisms to automatically route client requests to available data centers in the event of an outage.
- Load balancers monitor the health and status of data centers and distribute traffic to healthy and operational data centers to minimize service disruptions.
- Disaster Recovery Planning:
- Develop and regularly update disaster recovery plans to ensure quick and efficient recovery from data center outages.
- This includes procedures for data backup and restoration, failover mechanisms, communication protocols, and coordination with third-party service providers.
Load Balancing Strategies
Load balancing is a critical component in distributed systems to evenly distribute incoming requests across multiple servers or nodes, ensuring optimal resource utilization, scalability, and fault tolerance. There are several load balancing strategies available, each with its own advantages, disadvantages, and use cases. Let's discuss some common load balancing strategies:
- Round Robin:
- Description: In a round-robin strategy, incoming requests are distributed sequentially in a circular order among the available servers. Each server receives an equal share of requests.
- Advantages: Simple and easy to implement.
- Fairly distributes the load among servers.
- Disadvantages: Doesn't consider server load or capacity, leading to potential uneven distribution of workload.
- May not be suitable for scenarios where servers have different capabilities or performance characteristics.
- Least Connections:
- Description: The least connections strategy directs incoming requests to the server with the fewest active connections at the time the request is received. This ensures that the load is distributed based on the current server load.
- Advantages: Helps in achieving better load distribution and prevents overloading of individual servers.
- Suitable for scenarios where server capacity varies dynamically.
- Disadvantages: Requires monitoring of server connections, which may introduce overhead.
- May not be effective in scenarios where connection durations vary significantly.
- Weighted Load Balancing:
- Description: Weighted load balancing assigns a weight or priority to each server based on its capacity, performance, or other factors. Servers with higher weights receive more incoming requests than those with lower weights.
- Advantages: Allows administrators to allocate resources based on server capabilities and requirements.
- Provides flexibility in managing server loads and priorities.
- Disadvantages: Requires manual configuration and tuning of weights, which may be time-consuming and error-prone.
- May not adapt well to dynamic changes in server loads or capacities.
Trade offs/Tech choices
Here are a few trade-offs that we have to do with each tech choice we make.
- Data Storage Technology:
- Utilizing an in-memory database like Redis provides extremely fast read/write operations but may have limitations in terms of data size and durability, making it suitable for use cases requiring low-latency access to relatively small datasets.
- Opting for a distributed database such as Cassandra offers scalability, fault tolerance, and consistency features out-of-the-box but may come with a steeper learning curve and higher operational complexity, particularly in managing data consistency and replication settings.
- Replication Mechanisms:
- Implementing synchronous replication ensures strong consistency but may introduce higher latency and lower availability due to the need for acknowledgment from all replicas before completing write operations, making it suitable for applications prioritizing data consistency over performance.
- Employing asynchronous replication offers higher availability and lower latency but may lead to eventual consistency and the possibility of data divergence between replicas, making it preferable for scenarios where low-latency access and high availability are paramount.
- Load Balancing Strategies:
- Implementing round-robin load balancing ensures fair distribution of incoming requests among servers but doesn't consider server load or capacity, potentially leading to uneven workload distribution, making it suitable for environments with homogeneous server capabilities.
- Employing least connections load balancing directs requests to the server with the fewest active connections, ensuring better load distribution and preventing overloading of individual servers, but may require continuous monitoring of server connections, introducing additional overhead, making it preferable for environments with heterogeneous server capacities or dynamically changing workloads.
Future improvements
Below are 3 improvements that we can make as per our design.
- Enhanced Consistency Models: Introducing support for more sophisticated consistency models, such as causal consistency or session consistency, would allow for finer-grained control over consistency guarantees, enabling applications to balance between strong consistency and performance more effectively.
- Optimized Replication Strategies: Developing advanced replication strategies that dynamically adjust replication factors based on workload patterns, network conditions, and data access patterns could improve resource utilization and scalability while maintaining consistency and availability in diverse deployment environments.
- Intelligent Load Balancing: Implementing intelligent load balancing algorithms powered by machine learning or artificial intelligence techniques could enable the system to dynamically adapt to changing traffic patterns, optimize resource allocation, and mitigate the impact of sudden spikes or fluctuations in workload, improving overall performance and user experience.