Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Key Value Store with Score: 8/10

by vortex5266

System requirements

Functional:

Read and write: the system should be able to support key value pair read and write.

Non-Functional:

Low latency: high performance, fast access to data
Availability: the store should be in highly available with zero downtime
Durability: no data loss

Capacity estimation

Let's say this key value store is used by a tinyUrl service.

QPS:

100 million DAU -> every day each person create one tinyUrl -> 100M / 86400 = write: 1,000 QPS

Read 10 tinyUrl -> 100M * 10 / 86400 = 10,000 QPS

Storage:

100M * 30 days * 12 months = 36 billion mappings

8 bytes for id + 100 bytes for longurl = 4TB per year

If a server node could provide 100GB memory cache, then we need 40 primary servers a year.

High-level design

flowchart TB

subgraph server

direction TB

p1[P1] --> s1[S1.1]

p1[P1] --> s2[S1.2]

p2[P2] --> s3[S2.1]

p2[P2] --> s4[S2.2]

p3[P3] --> s5[S3.1]

p3[P3] --> s6[S3.2]

end

client <--> server

Client: the client is imported by the caller and provide get and set API to use. It's cluster-aware so when it's initialized it will be populated with the store cluster topology, which we will cover later.

Cluster: the cluster consists of primary nodes and secondary nodes to partition data.

Data Partition by Consistent Hashing

flowchart LR

vn1((A-01)):::green --> vn2((B-01)):::red

vn2((B-01)):::red --> vn3((C-01)):::blue

vn3((C-01)) --> vn4((A-02)):::green

vn4((A-02)):::green --> vn5((B-02)):::red

vn5((B-02)):::red --> vn6((C-02)):::blue

vn6((C-02)) --> vn7((A-03)):::green

vn7((A-03)):::green --> vn8((B-03)):::red

vn8((B-03)):::red --> vn9((C-03)):::blue

vn9((C-03)) --> vn1((A-01)):::green

classDef red stroke:#f00

classDef green stroke:#0f0

classDef blue stroke:#00f

So we would like to distribute data into different nodes to improve performance and scalability. In a normal way, when to decide which node to save a key value pair, we calculate it by hash(key) % n, where n denotes the total number of nodes. This leads to a severe inflexibility in adding/deleting nodes from the whole cluster because the data migration will happen on the all data set by hash(key) % (n - 1) or hash(key) % (n + 1) causing a significant overhead.

We will introduce the consistent hashing to improve this.

The whole hashing ring is divided by a very large number say 2^10 = 1024 equal parts.
Also to avoid non-uniform data and load distribution, we will introduce virtual node to put on the ring equally. In the graph above, we have three primary nodes, each of them comprises three virtual nodes. But in real world, one primary node consists of hundreds of virtual nodes to make the granularity very small.
The mod function becomes mod_result = hash(key) % 1024. After key is hashed and mod, it will fall onto one point of this hashing ring, we move clockwise until it meets the next virtual node. The primary node which this virtual node belongs to is the server to save this key value pair.

Then we could come up with a metadata form which maps the server id to the ranges. For example:

{

NodeA: range 0 to 99, 300 to 399, 600 to 699, 900 to 999

Node B, range 100 to 199, 400 to 499, 700 to 799, 1000 to 1024

Node C, range 200 to 299, 500 to 599, 800 to 899

}

Pros:

Small overhead by node replacement or data migration: only a small set of keys move when servers are added or removed.
Uniform data distribution

Data Replication

To meet the requirement durability and avoid single point of failure, each primary node will have at least two secondary nodes to replicate data to.

Resilience: the primary node and 1st secondary node are within the same region but different availability zone, and the 2nd secondary node is within another region. This makes the system resilient to zone failure or region down.

So when a write request comes in, the primary node will write the data to its memory first, then it writes the data to its secondary nodes. After all of them succeed, primary node returns success to the client.

So there is some trade-off between availability and consistency here. User could configure their consistency level, whether they need strong consistency or eventual consistency.

Advantage of synchronous replication: followers are guaranteed to have an up-to-date copy of data that is consistent with the primary.
Disadvantage: if a synchronous follower doesn’t respond, the write cannot be processed. The whole system grinds to a halt.
Semi-synchronous: part of the followers is synchronous, the others are asynchronous.

Asynchronous replication has weak durability, a write is not guaranteed to be durable. But Async is widely used. We could use the semi-sync to make sure the nodes within the same zone are synchronously written.

Decision: we will use semi-synchronous solution. Each primary node will maintain a set of in-sync replicas (ISR) that are caught-up to itself. Only members of this set are eligible for election as leader.

For read request, all the nodes including primary and secondary will be involved to handle the requests.

Storage on a Single Node

Since we are gonna store the data in the memory, we will use LRU cache as the data structure to achieve faster data access. LRU comprises a hash table and linked list. When an item is accessed (read or written to), it becomes the most recently used item and is moved to the front of the cache. LRU caches are particularly useful in scenarios where there is a high likelihood of accessing the same data repeatedly over a short period.

Request flows

Client API:

Get(string key) returns error
Set(string key, string value) returns error
Del(string key) returns error

Flow:

Client: run md5 hash function on the key and mod the result by pre-defined number of hashing ring slots, md5(key) % n.
Client: then client look for the target host ip from client cache and send the key value pair to the target primary node based on slot calculation result.
Primary node: primary node writes write-ahead log first, the log is an append-only data structure. Then it saves the key value in the memory.
Primary node: it sends the newly written log to other replicas to build the same copy of log.
Replicas: replica receives log and update its own write ahead log. Then make the update or creation on the key value pair.
Replicas return success to primary, primary returns success to client

Failure Tolerance

Primary node is down:

A majority of primary nodes are required to be available in order for the cluster to accept requests. Nodes in the cluster will perform health checks on each other and share that information using a gossip protocol. A node is declared failed once a majority of primaries agree, at which point a replica is promoted to primary. It could be crashes, power outages, network issues, etc.
Choose a new leader. Election process or appointed by a previously elected controller node. Consensus problem.
Re-configuring the system to use the new leader and also let the client know.

Secondary node is down

catch-up recovery: each follower keeps a log of data changes. From the log, it knows the last transaction that was processed before fault occurred. Connect to the leader and request all the data changes that occurred during failure. And replay the write operation by logs. Make sure we only select the replicas in the in-sync list to guarantee the data refreshness.

Write to primary node failure:

It could happen write fails because of network issue but primary node is not down, then retry would help. If primary node is down, then it may need to return error to the caller and relies on caller to do the error handling. But the primary node recovery or replacement should be very fast by the cluster management.

Read failure:

Client just retry multiple times, usually we will have multiple replicas to serve so even one node down doesn't affect the overall availability.

Future Improvements

Twemproxy: decouple some logic from client to a new light-weighted proxy to handle the routing rule and node discovery.
Cluster management: setup some control plane to manage the node reboot, node replacement, node recovery, auto-scaling etc.