350 likes | 430 Views
Learn about the Raft Consensus Algorithm developed by Stanford Platform Lab, its key ideas, client-server interactions, leader election, log replication, ensuring safety, Raft basics, server states, leader terms, APIs, log organization, and leader election process. Discover how this algorithm simplifies reaching consensus in distributed systems.
E N D
Distributed Systems: Raft Consensus Alg. Developed by Stanford Platform Lab • Motivated by the needs of RAMCloud • Goals: i) develop a more “understandable” consensus algorithm; 2) consensus algorithm for not a single value, but the entire (replicated) state machine (or log) • Key Ideas: • Separate the notions of log & state machine • user a (strong) leader-based approach to simplify the process of reaching consensus: only leader can propose • Separate leader election from reaching consensus • Notions of leader term & log index • Two simple RPC APIs: crash recovery via AppendEntries • handling memberships via joint consensus during transition CSci8211: Distributed Systems: Raft Consensus Algorithm
Log vs. State Machine Log State machine • Each server (or process) maintains a separate log & state machine • commands are stored linearly in log • log is immutable: append only! • log is persistent: stored in NV memory buffer and disk • Log consistency ensures state machine consistency • As long as commands are stored in the same order in all logs of servers (log consistency) • and applied to the state machines in the same order consistent state machines
Client-Server Interactions Client sends a command to one of the servers (the leader) The server (leader) adds the command to its log Leader forwards the new log entry to the other servers • Once a consensus has been reached, each server state machine process the command • Leader sends the reply to the client
Basic Raft Consensus Algorithm Decomposes the problem into three fairly independent sub-problems Leader election: How servers will pick a single leader during each term Employ a randomized algorithm (w/ randomized timeout value) Period “heart beat” messages to maintain leadership Log replication: How the leader will accept log entries from clients, propagate them to the other servers How to ensure the logs of servers remain in a consistent state, especially how to recover past log entries after a crash Ensuring Safety Impose conditions on who can be elected a leader in a new term!
Raft Basics: Servers RAFT cluster: consists of several (e.g., 5) servers Each server can be in one of three states Leader, Candidate (to be a new leader), Follower Followers are passive: simply reply to requests coming from their leader
Raft Basics: (Leader) Terms Raft timelines: elections & terms (normal operations) Terms: epochs of arbitrary length start with the election of a leader end when the current leader becomes unavailable, or during election, no leader can be selected (split vote) Different servers may observe transitions between terms at different times or even miss them
Raft Basics: More on Terms Terms act as (totally ordered global) logical clocks allow servers to detect and discard obsolete information (messages from stale leaders, …) Each server maintains a current term number includes it in all its communications A server receiving a message with a high number updates its own number A leader or a candidate receiving a message with a high number becomes a follower
Log Organization & Log Index Colorsidentify terms
Raft Basics: APIs Two simple APIs: RequestVote & AppendEntry RequestVote: initiated by candidates during elections AppendEntry:initiated by leaders to replicate log entries to other servers as a heartbeat message: empty AppendEntry( ) AppendEntry is idempotent: duplicates lead to same result Both implemented as RPCs this implies that every successful RPC call will receive a result upon the return of the RPC call; otherwise RPC call will fail (e.g., timeout), or receives a “false” return value
Leader Election All servers start as followers, remain so as long as they receive valid RPCs from a leader or candidate (during election) each server maintains an election timer, with a random value, say, from [150ms,300ms], reset it upon RPCs from leader/candidate A follower starts an election starts when it loses contact w/ leader/candidate, i.e., with the election timer times out: increments its current term transitions to candidate state (and restarts the election timer) votes for itself issues RequestVote RPCs in parallel to all the other servers There are additional conditions on whether a follower can be a leader candidate to ensure certain safety properties will be discussed later
Leader Election (cont’d) A (leader) candidate remains so until i) it wins election; ii) it drops out of the race; iii) a period goes by with no winner When to drop out of the leader election race: when a candidate receives an AppendEntries RPC (not RequestVote) from another server claiming itself to be the leader if the leader’s term is greater than or equal to its own term, it recognizes that leader and becomes a follower again otherwise the candidate reject the RPC and remains a candidate Winning an election: a candidate must receive votes from a majority of the servers in the cluster for the same term each server will vote for at most one candidate in a given term -- the first one that contacted it; esp. reject those w/ smaller terms majority (i.e., quorum) rule ensures that at most one candidate can win the election during each term
Leader Election Split elections: no candidate obtains a majority of the votes in the servers in the cluster Split election is “detected” or “declared” when the election time at a candidate times out it starts a new election after incrementing its term number, and restarts its election timer Randomized election timer values would ensure not many servers compete as candidates for the new leader and split election would be a rare occurrence in general Q: What about network partition? Once a new leader is elected: it will immediately issue “heartbeat” messages (empty AppendEntries) with the new term contained in the message force other candidates to drop out as soon as possible!
Strong Leader & Log Replication (Strong) Leader: all client operations thru the leader accepts client commands, and append them to its log (new entry) issues AppendEntry RPCs in parallel to all followers • applies the entry to their state machine once it has been safely replicated • RPCs from followers return only if they append their logs (persistent storage) • entry is then considered committed • leader applies commands to its state machine
A client sends a request Leader stores request on its log and forwards it to its followers Log Log Log State machine State machine State machine Client
The followers receive the request Followers store the request on their logs and acknowledge its receipt Log Log Log State machine State machine State machine Client
The leader tallies followers' ACKs Once it ascertains the request has been processed by a majority of the servers, it updates its state machine Log Log Log State machine State machine State machine Client
The leader tallies followers' ACKs Leader's heartbeats convey the news to its followers: they update their state machines Log Log Log State machine State machine State machine Client
Committed Entries & Log Consistency Guaranteed to be both durable, and eventually executed by all the available state machine Committing an entry also commits all previous entries all AppendEntry RPCS—including heartbeats—include the index of its most recently committed entry this is how followers learn what entries leader has committed without a 2PC protocol, and then apply to their state machines Handling slow followers (e.g., due to network issues): Leader retries AppendEntries -- idempotent operations Ensuring log consistency: strict sequential ordering, “no skipping” force followers to replicate its version of the log & delete their non-consistent entries via a “Go-Back-N” like protocol Leader: maintain per-server (volatile) state: nextIndex, matchIndex If AppendEntries rejected, decrement nextIndex, send AppendEntries with previous log entry, repeat until a matched entry found
Log Replication By AppendEntries(6,10): (a) will bring its log up-to-date The same AppendEntries(6,10) fails at (b) By successive trials and errors, leader finds out that the first log entry that follower (b) will accept is log entry 3 It then forwards to (b) log entries 3 to 10 current leader for term 6 (a) & (b) are followers with missing log entries [Note: (b) could be elected a leader in term 2, append a log entry, but promptly crashes afterwards, does not recover until recently] leader 2
Log Entry Matching Property colors indicate terms • two entries in different logs match if they have the same index and term • All previous entries in the two logs are then identical (i.e., match)
Log Consistency: Handling Leader Crashes Leader crash may lead to a inconsistent state if it had not fully replicated some previous entries the newly elected leader may have “missing” entries the older leader (and some followers) had (partially) replicated other followers may also miss entries that the new leader has Impose some conditions on who can be elected as the new leader to ensure certain safety properties a candidates include a summary of its state of log followers vote for a leader with more “up-to-date” log Force followers to duplicate the log of the new leader Leader completeness: committed entries in new leader’s log State machine safety: same log entries applied in same order
Handling Leader Crashes (new term)
Leader Completeness? How will the leader know which log entries (e.g., from previous terms it was not the leader) that it can commit? cannot always gather a majority since some of the replies were sent to the old leader Fortunately, any follower accepting an AcceptEntry RPC implicitly acknowledges it has processed all previous AcceptEntry RPCs followers’ logs cannot skip entries – they have missing entries or extra entries that are not committed A follower/candidate with missing committed entries cannot be elected a leader, as its log not most up-to-date! Leader Append-Only: a leader never overwrites or delete entries in its log; only append new ones!
Committing Entries from Previous Term S1 leader (4) S1 leader (4) S1 leader (2) OR S5 leader (3) S5 leader (5) Can S5 be elected a leader?
Safety & Election Restriction Servers holding the last committed log entry (they form a majority before failure) Servers having elected the new leader (they must form a majority after failures) Two majorities of the same cluster mustintersect, due to Quorum (majority) requirements: a new leader can be elected!
Explanations In (a) S1 is leader and partially replicates the log entry at index 2. In (b) S1 crashes; S5 is elected leader for term 3 with votes from S3, S4, and itself, and accepts a different entry at log index 2. In (c) S5 crashes; S1 restarts, is elected leader, and continues replication. Log entry from term 2 has been replicated on a majority of the servers, but it is not committed. If S1 crashes as in (d), S5 could be elected leader (with votes from S2, S3, and S4) and overwrite the entry with its own entry from term 3. However, if S1 replicates an entry from its current term on a majority of the servers before crashing, as in (e), then this entry is committed (S5 cannot win an election). At this point all preceding entries in the log are committed as well.
Cluster Membership Changes • Use a two-phase approach: • switch first to a transitional joint consensus configuration • Once the joint consensus has been committed, transition to the new configuration • Not possible to do an atomic switch • Changing the membership of all servers at one
Joint Consensus Configuration Log entries are transmitted to all servers, old and new Any server can act as leader Agreements for entry commitment and elections requires majorities from both old and new configurations Cluster configurations are stored and replicated in special log entries
Conclusion Implementation: • Two thousand lines of C++ code, not including tests, comments, or blank lines. • About 25 independent third-party open source implementations in various stages of development • Some commercial implementations Raft is much easier to understand and implement than Paxos and has no performance penalty