My Solution for Design a Key Value Store with Score: 9/10
by dynamo7055
System requirements
Functional:
- client can store key-value data by calling the APIs which are provided by the store.
- the store is distributed
- the system can be auto scaled depending on the load
- the store can support large data set
Non-Functional:
- high availability: client can access the key-value store anytime
- high reliability: no data loss
- low latency: the client can rapidly get response even if some nodes are failed.
Capacity estimation
Estimate the scale of the system you are going to design...
- the size of the key-value entry is less than 10kb.
API design
- T get(T key): the client can get the value with the key if the key-value entry exists.
- put(T key, T value): if the key exists in the store, update the value in the store. Otherwise, create a new entry key-value into the store.
Database design
The storage of the key-value entries we can use SSTable (sorted string table).
High-level design
Read: client -> Server -> SSTable
Write: client -> Server -> SSTable
Overall design
client ----> Nodes
the nodes are distributed and each has server, memory cache and SSTable.
System Diagram:
client --------Coordinator Node ----->>>> node1, node2, node3, .... nodeN (these nodes and the coordinator are on the hash ring).
Request flows
Explain how the request flows from end to end in your high level design. Also you could draw a sequence diagram using the diagramming tool to enhance your explanation...
The diagram is on the high level design. This is the request flow in high level.
- clients use APIs to communicate with the key-value store: get() and put()
- client sends read/write request to the coordinator node which acts as a proxy between client and server nodes.
- Nodes are distributed and are spread on a hash ring
- Nodes are decentralized, they can auto scaling depending on the load.
- data is replicated onto multiple nodes.
- No single point failure as each node worked the same responsibility: client APIs, data replica, failure detection, fix failure, etc.
Detailed component design
Dig deeper into 2-3 components and explain in detail how they work. For example, how well does each component scale? Any relevant algorithm or data structure you like to use for a component? Also you could draw a diagram using the diagramming tool to enhance your design...
- Auto scaling:
- to achieve the goal of minimizing the key mapping to servers when adding/removing server, we use consistent hashing for servers. First we determine the hash space, whatever we take the hashing method. The we can have a hashing ring by connecting the last and first hashing value of the hashing space. The virtual nodes (from physical nodes) evenly spreading on the hashing ring. The key can walk clockwise to find the virtual node when it mapped to the hash ring. This virtual node is the node to store the key-value.
- Data Partition
- For a large application, we are not able to store the whole data set into a single server. We need to partition the data set into multiple data servers. We can use above mentioned consistent hashing to achieve the key and node mapping.
- Data Replication
- To avoid data loss if a server crashed, we need to replicate one data onto multiple nodes.
- When a key is mapped onto the hash ring, walk clockwise to find N virtual nodes and store the copied data onto them. These N virtual nodes must be on different physical nodes.
- Consistency
- To ensure all clients get the same data when reading the same key from different nodes, the data needs to propagate to other nodes when this data is changed or newly created. There are following consistency models:
- strong consistency: all read/write operation will be blocked until all replicas agree on this write. This will lower the system performance. It is usually used in the scenario when the high consistency is needed, like bank system.
- weak consistency: the client may get the outdated data from other nodes after the write on one node.
- eventual consistency: it is a specific weak consistency. Given enough time, the new data will be propagated to other nodes. Based on the requirements, we will use this consistency model.
- Consistency Resolution
- when two clients write to the same key-value on different nodes, their value will be conflict.
- We can use vector clock [ServerNumber, DataVersion] to get the final value by determining the preceding / succeeding relationship of the two data.
- downside of the vector clock:
- add complicated logic
- the vector clock may grow rapidly. This can be solved by removing the outdated [server, version] pair by a predefined threshold.
- Failure Detection
- usually we can't consider a node down from a single source. We need to get it from at least two independent source.
- Decentralized Failure Detection (Gossip Protocol):
- Each node maintains a node member list which has the memberID and heartbeat counter.
- each node periodically increment the heartbeat counter
- Each node periodically sends the heartbeat to a set of random other nodes which in turn propagate to other nodes.
- If a node heartbeat counter was not incremented for a predefined period, it is considered as offline.
- Handling failure:
- if a node is down, another nodes will handle the requests temporarily. When the node is up, changes will be pushed back to it to achieve data consistency. This is called "hinted handoff"
- Read Path
- [Cache Hit]: client -----> memory cache (Memory)
- 1. client sends a request to read from memory cache. If the data is in cache, it is returned to client.
- [Cache Miss]: client -> memory cache (memory) ---->> bloom filter ----> SSTable ---> Data Result -> Client
- 1. if the requested data is not in cache, it has to be queried from disk (SSTable)
- 2. the bloom filter is used to determine which SSTable partition might have the requested data.
- 3. When the requested data is found in SSTable, it is returned back to client.
- Write Path
- client --------(1)-> commit log
|(2)
cache------(3)--->SStable
(1). client sends a request to write data. The write first persists to commit log
(2). the data is saved to cache.
(3). the the data is flushed from cache to SSTable.
Trade offs/Tech choices
Explain any trade offs you have made and why you made certain tech choices...
- How to trade off consistency, availability and partition tolerance?
- because in practice, network partition is unavoidable so the system has to tolerant the partition between nodes.
- As the real system can only support 2 out of 3. we have to choose either consistency or availability.
- In our design, as it is not required the high consistency, we prefer the high availability system and eventual consistency strategy.
Failure scenarios/bottlenecks
Try to discuss as many failure scenarios/bottlenecks as possible.
- how to handle when a node failed?
- As we use consistent hashing to handle the key and server nodes mapping relationship. When a node failed, other nodes will process the requests. When the server is up, the changes will be backed to it to achieve data consistency. This is called "hinted handoff"
Future improvements
What are some future improvements you would make? How would you mitigate the failure scenario(s) you described above?