280 likes | 313 Views
State Machines. CS 614 Thursday, Feb 21, 2002 Bill McCloskey. Introduction. State machines provide fault-tolerance through replication. They consist of state variables and commands to change the state. Clients request the state machine to execute commands. State A. State B. Command.
E N D
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey
Introduction • State machines provide fault-tolerance through replication. • They consist of state variables and commands to change the state. • Clients request the state machine to execute commands. State A State B Command Client
An Example: Memory State variables: store: array[0..n] of word Commands: read: command(loc: 0..n) sendstore[loc] toclient write: command(loc: 0..n, value: word) store[loc] := value Reads and writes values to and from storage.
Ordering of Commands • The machine will have multiple clients. • Commands from the same client must be executed in the order they were issued. • Commands from different clients must be executed in an order determined by causality.
Fault-tolerance • Replicas of state machine are run on multiple processors for fault-tolerance. • Replicas must start in the same initial state and must process the same set of requests in the same order. This is a consensus problem. • Agreement: Every non-faulty replica receives the every request. • Order: Every non-faulty replica processes the requests in the same order. • There are several ways of achieving these conditions…
Agreement • Often, agreement is met using a Byzantine Agreement protocol. Every non-faulty processor will receive each command. • Clients can transmit commands to the replicas, or a single replica can serve as transmitter for the client. • More efficient techniques for fail-stop failures are can be used instead. • Other techniques are also possible, such as the Paxos algorithm (more later).
Order and Stability • A request can be labeled with a unique ID (uid). • The request is considered stable at a certain state machine replica when requests with lower uids cannot be received from correct clients. • The replica must wait until a request is stable before executing it. • So, a state machine processes requests in order of uids. Therefore, uid ordering is constrained by causality of requests (from earlier). • Possible stability tests use Lamport clocks or real-time clocks. Replicas may also agree on a uid using agreement.
Achieving Stability with Lamport Clocks • Messages marked with a logical timestamp. This is its uid. • Causality requirements are satisfied. • Clients must periodically make “null” requests. • A request is stable at a replica when a request with a larger timestamp has been received from every client. Then no lower uids can arrive. • Requires FIFO channels (easy). Works in the presence of fail-stop failures.
Achieving Stability with Real-Time Clocks • Real-time clock value, together with the identity of the sending process, is the uid. • To satisfy causality, a client can make only one request per clock tick, and message delivery must take longer than the difference between clocks on different processors. • Let be the time for a request to reach every correct processor. • A request is stable if its timestamp is at least time units in the past, according to the local clock. This imposes a delay in processing.
Replica-Generated Uids • Ordering using Lamport clocks requires all processors to communicate (null requests). • Real-time clock ordering requires clock synchronization, also expensive. • Could also have the replicas themselves agree on a uid for each request. • Each replica proposes a candidate uid. The replicas agree on a uid, accepting the request. • Clients cannot execute a request until the previous request is accepted, to guarantee causality.
Implementing Replica-Generated Uids • Final uid is always at least the candidate uid. • A request r’ seen after a request r has been accepted has a higher candidate uid than the final uid of r. • A new candidate uid will be one greater than any candidate or final uid so far, plus a factor of i/N to make it unique. • Each replica broadcasts its candidate uid. • The final uid is selected as the maximum of all the uids received.
Paxos: Another Approach • Lamport’s Paxon Synod is an agreement algorithm. • It is efficient and practical. • It assumes a partially synchronous model. • Messages are not always delivered on time. • Messages may be lost or duplicated. • Processors may fail silently. • Guarantees: • Agreement: Everyone agrees on the same value. • Validity: The chosen values was one of the candidates. • Termination is not guaranteed.
Stability • An execution fragment is stable if: • No processors fail or recover in . • No packets are lost or duplicated in . • Delivery of messages is on time. • is nice if it is stable and if a majority of processes are alive. • We’ll see that Paxos terminates if there is an execution fragment which is nice for long enough.
Leader Election • Paxos requires a leader to “run” the algorithm. • Processes exchange “Alive” messages to try to detect failures. • When the current leader fails, a new one is selected which has the largest processor ID. • Failure detector doesn’t always work, so there may be multiple leaders or no leader. • The algorithm may not terminate if there are always too many leaders.
Setup • The algorithm operates in a sequence of rounds. • Multiple rounds may be ongoing at the same time. • In each round, the leader tries to get a majority for a certain value. • Processes vote in each round. • If a majority of processes vote in a round, then the value chosen is the one proposed by the leader. • If too few processes vote in the round, it fails and a new one is started.
Rounds • Each round is numbered with a tuple (l, r), where l is the process ID of the leader and r is the leader’s index for that round. • Lexicographic ordering is used on the rounds. • This way, round numbers are unique. • Thus, each round has a unique value, since a leader only proposes one value, and a round only has one leader.
Algorithm • Leader informs other replicas that round R is starting. • Each replica finds the last round before R in which it voted. It sends this vote to the leader. • The leader waits for these votes from a majority set (quorum) Q. • Based on these previous votes, the leader decides to propose a certain value v for the new round, and informs the replicas in Q. • Each replica may vote in this round or not. If they choose to vote, they send the vote to the leader.
Algorithm (2) • If the leader receives a vote from every replica in Q, it informs everyone that v is the consensus value.
Voting • Why would a replica ever decide not to vote? • When it gives its last vote (which, say, was in round R’) to a leader in round R, then it must not vote in any round from R’ to R, since that would invalidate the information it sent. • This means that if leaders keep starting new rounds, everyone will be forbidden to vote and the algorithm will never terminate. • If a majority of processes are forbidden to vote in a round, the round is dead. A dead round can never succeed.
Anchored Rounds • Let vR be the value that the leader proposes for round R. (We saw that this is well defined.) • If no quorum is found, vR = null. • A round is anchored if all rounds before it are either dead or have the same value vR. An anchored round stays anchored (stable). • Paxos will be set up so that every round is anchored or has vR = null. • This implies that any two successful rounds have the same value…
Any Two Successful Rounds Have the Same Value for all R, vR = null or R is anchored • for all R, R’ R, if R’ is not dead then vR= null or vR=vR’ • for all R, R’ R, if R’ is successful then vR= null or vR=vR’ • Any two successful rounds have the same value This is the essential property that we needed. It tells us that once there is agreement by a majority, all future rounds will agree on the same value. Now we need to assure that all rounds will be anchored.
Anchoring the Rounds • When the leader has received the most recent votes (and the values voted for) from a majority of replicas, it must propose a value for the current round R, to keep it anchored. • It looks through previous rounds from R, skipping over those in which no value was reported. These rounds must be dead, since a majority chose not to vote for them. • When it gets to a round R’ with a value, it chooses the same value for the new round. • Since R’ was anchored, and all rounds between R and R’ are dead, R’ is anchored. • If it finds no R’, it can choose the value to be its initial value (given as part of agreement).
An Example All rounds are dead. So with complete information, leader could choose any value. Q = {A,B}: Round 4 will use value 8 Q = {A,C}: Round 4 will use value 9 Q = {B,C}: Round 4 will use value 9
Another Example Another Example Round 2 succeeds. Rounds 1 and 3 are dead. Q = {A,B}: Round 4 will use value 8 Q = {A,C}: Round 4 will use value 8 Q = {B,C}: Round 4 will use value 8
Summary of Paxon Synod • This completes the proof that Paxos is correct. • Validity: Leaders always propose values that they were given or that were proposed before. • Agreement: The leader sends the consensus result to everyone. Even if more rounds take place, they’ll always produce the same value. • Now we have an agreement algorithm which works in a realistic environment, but which may not terminate when failures occur.
The Paxon Parliament • Paxos agrees on a single value. For state machines, we need to agree on the commands to execute. • We can consider a numbered list of commands which will be executed. The identity of these commands will be decided by consensus. • A single leader will run an instance of Paxos for each index. • For a finite number of indices, the leader is forced to pick commands based on previous voting. For the rest, it chooses commands as they come from the client.
Summary • A client must wait for a command to reach consensus before requesting another, to satisfy causality. • Read operations can be satisfied by checking the local state or by executing a read command, which guarantees proper ordering. • Lamport proposes many other optimizations. • Ideally, agreeing on a command takes 3n messages for n replicas.