TiDB
Data Security
Raft Protocol
Database Management
System Failures

What happens if a TiDB leader goes down? How does TiDB use Raft to ensure data security and consistency?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

TiDB is a distributed SQL database that emphasizes scalability and consistency, widely employed where high-throughput and reliability are essential. Particularly prevalent among enterprises with large-scale data needs, TiDB operates using a transactional key-value store called TiKV. One of the core technologies in TiDB that enhances its robustness in data handling and consistency is the Raft consensus algorithm, which plays a critical role, especially when a leader node in the system fails.

Raft in TiDB

Raft is a consensus algorithm designed as an understandable and more straightforward alternative to Paxos. It ensures that the replicated log among multiple distributed database nodes remains consistent. In Raft, each data entry is replicated on a majority of nodes, maintaining a system of checks and balances that ensures no single node failure would lead to data inconsistency or loss.

Raft divides time into terms, and each term has a leader elected among the nodes. The leader handles all the client requests and replicated data, ensuring that every change to the data (written in terms of log entries) is consistent across the cluster.

Leader Election

In Raft, leader election is triggered when a node does not receive a heartbeat from the existing leader for a certain amount of time. This node, perceiving the leader to be down or unresponsive, increases its term number and transitions into a candidate state to request votes from other nodes.

If the candidate receives a majority of votes from the nodes in the cluster, it becomes the new leader for the term. Each node in the cluster keeps track of the candidate’s log and will only grant a vote if the candidate’s log is at least as up-to-date as its log, which ensures the elected leader has all the committed entries.

Scenario: Leader Fails in TiDB

When a TiDB leader node fails, several steps are triggered:

  1. Detection of Leader Failure: Followers realize the absence of the leader’s heartbeat signals. After the configured election timeout, the followers begin a new leader election.
  2. Election of New Leader: A follower increments its term and transitions to a candidate state, then issues a request for votes to other nodes. When a new leader is elected, it needs to update itself with any missing entries from the follower nodes, ensuring all previously committed entries are present.
  3. Resuming Operations: The new leader begins accepting and processing client requests. Any uncommitted logs the previous leader was handling are re-processed to maintain consistency and durability.

Data Security and Consistency

TiDB uses Raft’s log replication feature to ensure data security and consistency:

  • Log Replication: The leader replicates log entries to its followers and waits for a majority to write the entries before considering the operation committed. This ensures that even if nodes fail, the system can recover without losing confirmed transactions.
  • Consistency Read and Write: For a read, if the 'stale read' is not specified, the read requests are forwarded to the leader to ensure that the read data is up to date. For writes, the leader replicates the entry and waits for a safe majority to respond before applying the entry.

Summary Table of TiDB's Failure Handling

EventAction
Leader FailsTrigger leader election
ElectionNode with most updated logs elected as leader
Data IntegrityNew leader syncs missing logs from followers
Data RecoveryConsistent and up-to-date log replication
Service ResumptionNew leader accepts queries and updates state

Conclusion

In summary, TiDB handles the failure of a leader effectively via the Raft protocol, ensuring minimal disruption and maintaining data integrity and consistency. This procedure assures clients of the database's reliability, crucial for systems where downtime or data inconsistencies can result in significant operational and financial complications.


Course illustration
Course illustration

All Rights Reserved.