My Solution for Design a Key Value Store

by nectar4678

System requirements

Functional Requirements:

  1. Assign a value to a specified key, updating it if already present.
  2. Retrieve the value associated with a given key.
  3. Delete the value associated with a key, effectively removing the key from the system.


Non-Functional:

  1. Scalability: The system must effectively manage surges in demand by seamlessly incorporating additional instances.
  2. Consistency: Every request to the system must yield a reliable and accurate response consistently.
  3. Durability: The system must ensure data integrity even during network partition incidents, preventing data loss.
  4. Availability: The system should be resilient against single points of failure, adhering to the CP (Consistency-Partition Tolerance) model, prioritizing consistency over availability according to the CAP theorem.


Capacity estimation

Key considerations include the volume of queries per second (QPS) and the amount of data the system needs to manage. For the purposes of our estimation:

  1. Total Storage Capacity: 100 terabytes
  2. Total Queries Per Second (QPS): 100,000


API design

Key-value stores, such as conventional hash tables, offer two main operations: retrieval (get) and insertion (put). Now, let's delve into the design of the API.


GET

We retrieve the corresponding value based on the provided key parameter. When data undergoes replication, it identifies the replica object linked to a particular key, which remains concealed from the end user. This process occurs within the system if the store is set up with a less stringent data consistency model. For instance, in the case of eventual consistency, multiple values might be returned for a single key.

get(key)


PUT

The API call for inserting a value into the system should be

put(key, value)

It retains the value linked with the key, with the system autonomously deciding the data's placement. Moreover, the system frequently maintains metadata pertaining to the stored object, which may encompass details such as the object's version.

In a key-value store, the key typically serves as a primary identifier, whereas the value can encompass any arbitrary binary data.

Now that we've defined the operations on the cache, our attention shifts to the implementation specifics of these functionalities.


A Simple in-memory Cache

Designing a key-value store residing on a single server, fully in memory, presents a straightforward and uncomplicated approach. This setup typically involves a single OS process, equipped with an in-memory dictionary to house all data associations. The primary advantages of such a system are its simplicity in design and its inherent speed, making it an appealing solution at first glance.

Expanding on the benefits of a single-machine cache design,

  1. It offers ease of maintenance and management. With all data stored within a single server's memory
  2. Administrative tasks such as monitoring, troubleshooting, and updates become more streamlined.
  3. The reduced complexity simplifies deployment and minimizes potential points of failure, enhancing overall system stability.

However, despite these merits, relying solely on this architecture falls short of meeting the non-functional requirements outlined. Notably, it lacks reliability, scalability, and high availability, which are crucial for robust and resilient systems.


Issue with this approach

To understand the shortcomings of the system we need to understand the CAP theorem and linearizability of the system.

Linearizability means that modifications happen instantaneously, and once a registry value is written, any subsequent read operation will find the very same value as long as the registry will not undergo any modification.


Linearizability refers to the fact that changes are made instantly, and that once a registry value is written, each subsequent read operation will return the same value as long as the register remains unchanged.

So, if the requirements are such that we require a Linearizable system then, if replicas are disconnected from the other replicas owing to a network fault, as a result, some replicas are unable to execute requests while detached, they must either wait until the network problem is resolved, or they must wait until the network problem is resolved (Consistency).

In this case, we don’t have a single registry or single source of truth. We employ asynchronous database replication in our system, with a Primary node that handles both reads and writes and a Follower node that handles only reads. Which is the fundamental problem while design a distributed cache system.


CAP Theorem

In distributed systems, network partitions are inevitable, creating a trade-off between consistency and availability. Modern systems must decide whether to prioritize availability (AP) or consistency (CP) in the face of network partitions. This choice becomes crucial as distributed systems inherently face network partitioning.

Get to know more about cap theorem and its limitations here: https://ibvishal.medium.com/why-cap-theorem-is-not-enough-2771e6d9f949


Distributed Key-value store

To address one of the fundamental design requirements, scalability, our system employs storage nodes to manage key-value data. As demand fluctuates, the need may arise to adjust the number of storage nodes, necessitating the partitioning of data across these nodes to ensure even distribution of the workload.



For instance, let's envision a scenario with four nodes, where we aim to evenly distribute the load by directing 25% of the requests to each node. The conventional approach to achieve this involves utilizing the modulus operator. Upon the arrival of a request, associated with a specific key, we compute the hash of that key. Subsequently, by taking the modulus of the hashed value with the total number of nodes (denoted as m), we obtain a remainder value, x. This remainder value corresponds to the node number to which we route the request for processing.

Our aim is to ensure scalability while minimizing disruptions during node addition or removal. However, the current approach proves inefficient when nodes are added or removed, requiring the transfer of a large number of keys. For example, if we remove node 2 and the request is now directed to node 1 due to the modulus calculation (e.g., 15%4=3), the data associated with that request must be replicated to the next node responsible for processing it. Since each node maintains its local cache, including keys and values, this replication process can be resource-intensive and lead to heightened latency.


Consistent Hashing

Consistent hashing proves to be a highly efficient method for distributing the workload among a cluster of nodes. In this approach, we envision a virtual ring of hash values ranging from 0 to (n-1), where 'n' represents the total number of available hash values. Each node's unique identifier undergoes hashing, and its resultant hash value is mapped onto this ring. Similarly, requests follow the same process. Each request is fulfilled by the next node encountered while traversing the ring in a clockwise direction.

The addition of a new node to the ring impacts only the immediate next node. This node must redistribute its data to accommodate the newly added node, while the other nodes remain unaffected. This scalability feature is facilitated by minimizing changes to the existing nodes, as only a small fraction of keys need to be relocated. Furthermore, since the hash values are uniformly distributed, we anticipate that the request load will be evenly distributed across the ring on average.

One of the primary benefits of consistent hashing is its capacity to minimize the necessity for key relocation when nodes are added or removed from the system. However, in real-world scenarios, the distribution of requests may not be uniform. This can lead to certain servers handling a disproportionate amount of data, resulting in what are known as hotspots.

For instance, a significant portion of requests may be directed between nodes N4 and N1. Consequently, Node N1 could bear the brunt of handling a majority of the requests, creating a hotspot. This non-uniform distribution of the workload increases the strain on Node N1, potentially impacting overall system performance.


Data replication

Approach utilizing a primary-secondary structure

In a primary-secondary setup, one storage area acts as the primary, while others serve as secondary storage. The secondary storage replicates data from the primary. The primary handles write requests, while the secondary handles read requests. Due to replication, there's typically a delay after writing. Additionally, if the primary storage fails, writing operations are not possible, resulting in a single point of failure.


Peer-to-peer approach

In a peer-to-peer approach, all participating storage nodes act as primaries, continuously replicating data to maintain synchronization. Both read and write operations are permitted on all nodes. However, replicating data across all available nodes (n) is often impractical and resource-intensive. Instead, it's common to opt for a smaller number of replicas, typically three or five nodes, to balance efficiency and redundancy.

In the framework of the CAP theorem, key-value stores face a trade-off between consistency and availability during network partitions. For key-value stores, prioritizing availability over consistency is often preferred. This means that in the event of network partitioning where two storage nodes lose connection for replication, they continue to handle incoming requests independently. Upon re-establishing the connection, the nodes synchronize their data. However, during the disconnected phase, it's probable that the nodes may become inconsistent. Therefore, mechanisms are required to resolve conflicts arising from such inconsistencies. In subsequent lessons, we will explore concepts such as data versioning to address these issues.


How to handle temporary failures

In distributed systems, a quorum-based approach is commonly employed to handle failures. A quorum represents the minimum number of votes required for a distributed transaction to proceed with an operation. When a server, essential for achieving consensus, is unavailable, it impedes the system's availability and durability.

Instead of adhering strictly to quorum membership, we can opt for a sloppy quorum approach. Typically, a leader oversees communication among consensus participants. These participants acknowledge successful writes, and upon receiving these acknowledgments, the leader responds to the client. However, participants are vulnerable to network outages. If the leader experiences a temporary outage and participants cannot reach it, they declare the leader as inactive, necessitating the election of a new leader. Frequent leader elections can negatively impact performance as the system expends more time selecting a leader than executing actual tasks.


How to handle permanent failures

In the face of permanent node failures, maintaining synchronized replicas becomes imperative for enhancing system durability. To expedite the detection of inconsistencies between replicas and minimize data transfer, we employ Merkle trees.

In a Merkle tree, individual key values are hashed and serve as the leaves of the tree. Parent nodes higher up the tree contain hashes of their respective children. Each branch of the Merkle tree can be independently verified without necessitating the download of the complete tree or dataset. During the process of cross-checking for inconsistencies among copies, Merkle trees significantly reduce the volume of exchanged data.


Conclusion

A key-value store offers versatility and enables scalability for applications dealing with unstructured data. Web applications, for instance, can leverage key-value stores to manage user session information and preferences efficiently. By using a unique user key, all associated data becomes readily accessible. Key-value stores excel in facilitating rapid read and write operations, making them ideal for applications requiring quick data access.

Furthermore, key-value stores can power real-time recommendation engines and advertising platforms due to their ability to swiftly retrieve and present up-to-date recommendations.