390 likes | 621 Views
Fault Tolerance and Recovery. Mostly taken from www.logos.ic.i.u-tokyo.ac.jp/~horita/study/ recovery . ppt. Contents. Introduction Recovery Checkpointing Difficulty of Checkpointing Synchronous checkpointing / recovery ( Asynchronous checkpointing / recovery ). Introduction.
E N D
Fault Tolerance and Recovery Mostly taken from www.logos.ic.i.u-tokyo.ac.jp/~horita/study/recovery.ppt
Contents • Introduction • Recovery • Checkpointing • Difficulty of Checkpointing • Synchronous checkpointing / recovery • (Asynchronous checkpointing / recovery)
Introduction • Long computation in distributed environments • High failure rate • Host failure (a lot of hosts) • Network failure One failure may disturb entire computation ⇒ Need to start it again from the beginning • High cost Why don’t we utilize the previous computation? Recovery
Recovery is not easy Suppose that a parallel computation is running in distributed resources… 1 7 8 1 7 8 1 7 1 7 for(i=0; i<MAXITER; i++){ local_compute(); // compute at each host global_state_exchange(); // communicate with neighbors } • need to save process states periodically • usually other processes have to restore to previous state • overhead
Back/Forward Error Recovery • Forward-error recovery • Only when it is possible to remove errors • Enable processes to move forward • Ex) Redundancy, vote • Backward-error recovery • General • Restore to a previous error-free state • Ex) Checkpoint
Backward-error recovery • operational-based approach • Record all modifications of a process’ state • state-based approach • Record complete state at certain point
State-based approach Terminology • checkpointingthe process of saving state • checkpoint the recovery point at which checkpointing occurs • rolling back the process of restoring a process to a prior-state
Problem of naïve checkpointing • Orphan Messages and the Domino Effect • Orphan message :a message that make an inconsistent state • Domino Effect : what a single rolling back induce other rolling back • Lost Messages • Livelocks
Orphan message and Domino Effect x1 x2 x3 [ [ [ X Y has not sent yet, but X has received. y1 y2 : Orphan message [ [ Y Roll back z1 z2 [ [ Z : Domino Effect • Suppose Y fails after sending message m and is rolled back to y2. Now X has received message m from Y, but Y has no record of sending it. m is referred to as an orphan message. • if Z is rolled back, all three processes must roll back to their very first recovery points, namely, x1, y1, and z1.
Lost messages x1 x2 x3 [ [ [ X X has sent, but Y cannot receive forever y1 y2 : Lost message [ [ Y Roll back z1 z2 [ [ Z If Y fails after receiving message m, the system is restored to state {x1, y1}, in which message is lost as process X is past the point where it sends message m.
Livelocks • Process Y fails before receiving message n1. When Y rolls back to y1, there is no record of sending message m1, hence X must roll back to x1. And then Y receives message n1, but X has no record of sending it, so Y must roll back again. The situation will repeat infinitely. x1 [ X n2 n1 m2 m1 n1 y1 Y [
Consistency of Checkpoint • Strongly consistent set of checkpoints no messages penetrating the set • Consistent set of checkpoints no messages penetrating the set backward x1 x2 [ [ No orphans and domino effect but need to deal with lost messages y1 y2 [ [ Strongly consistent consistent z1 z2 [ [
C1 C2 p q r s Consistent Set of Checkpoints • Informally, a cut (global state) in the time diagram is consistent if no arrow starts on the right-hand side and ends on the left-hand side of it. • C1 is consistent, but C2 is inconsistent.
Useless Checkpoints • Q_2 is a useless checkpoint • Q_2 records the receipt of m1, but not the sending of m2 • {P1,Q_2} cannot be consistent (otherwise m1 would become an orphan); similarly {P_2,Q_2} cannot be consistent (since otherwise m2 would become an orphan)
Checkpoint/Recovery Algorithm • Synchronous • with global synchronization at checkpointing • Asynchronous • without global synchronization at checkpointing
Preliminary (Assumption) ~Synchronous Checkpoint~ Goal To make a consistent global checkpoint Assumptions • Communication channels are FIFO • No partition of the network • End-to-end protocols cope with message loss due to rollback recovery and communication failure • No failure during the execution of the algorithm
Preliminary (Two types of checkpoint) ~Synchronous Checkpoint~ tentative checkpoint: • a temporary checkpoint • a candidate for permanent checkpoint permanent checkpoint: • a local checkpoint at a process • a part of a consistent global checkpoint
Checkpoint Algorithm ~Synchronous Checkpoint~ Algorithm • an initiating process (a single process that invokes this algorithm) takes a tentative checkpoint • it requests all the processes to take tentative checkpoints • it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded • if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. • it informs all the processes of the decision • The processes that receive the decision act accordingly Supplement Once a process has taken a tentative checkpoint, it shouldn’t send messages until it is informed of initiator’s decision.
Diagram of Checkpoint Algorithm ~Synchronous Checkpoint~ Tentative checkpoint decide to commit Initiator permanent checkpoint [ | [ request to take a tentative checkpoint OK [ | [ [ | [ consistent global checkpoint Unnecessary checkpoint consistent global checkpoint
Optimized Algorithm ~Synchronous Checkpoint~ Each message is labeled by order of sending Labeling Scheme ⊥ : smallest label т : largest label last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, ⊥is in it. first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . if not exists, ⊥is in it. ckpt_cohortX :the set of all processes that may haveto take checkpoints when X decides to take a checkpoint. [ X x3 x2 y1 y2 [ Y y2 x2 Checkpoint request need to be sent to only the processes included in ckpt_cohort
Optimized Algorithm ~Synchronous Checkpoint~ ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥} Y takes atentative checkpoint only if last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥ last_label_rcvdX[Y] [ X [ Y first_label_sentY[X]
Optimized Algorithm ~Synchronous Checkpoint~ Algorithm • an initiating process takes a tentative checkpoint • it requests p∈ ckpt_cohortto take tentative checkpoints ( this message includes last_label_rcvd[reciever] of sender ) • if the processes that receive the request need to take a checkpoint, they do the same as 1.2.; otherwise, return OK messages. • they wait for receiving OK from all of p∈ ckpt_cohort • if the initiator learns all the processes have succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. • it informs p∈ ckpt_cohortof the decision • The processes that receive the decision act accordingly
Diagram of Optimized Algorithm ~Synchronous Checkpoint~ Tentative checkpoint Permanent checkpoint decide to commit [ [ | A 2 >= 0 > 0 ab1 ba1 ba2 ac1 ca2 [ | [ 2 >= 1 > 0 B OK ac2 cb2 cb1 bd1 [ [ | 2 >= 2 > 0 C cd1 dc1 dc2 [ D ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ } last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
Correctness ~Synchronous Checkpoint~ • A set of permanent checkpoints taken by this algorithm is consistent • No process sends messages after taking a tentative checkpoint until the receipt of the decision • New checkpoints include no message from the processes that don’t take a checkpoint • The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.
Recovery Algorithm ~Synchronous Recovery~ Labeling Scheme ⊥ : smallest label т : largest label last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, ⊥is in it. first_label_sentX[Y] :the first message that X sent to Y after X took its last permanent or tentative checkpoint . If not exists, ⊥is in it. roll_cohortX :the set of all processes that may have to roll back to the latest checkpoint when process X rolls back. last_label_sentX[Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it.
Recovery Algorithm ~Synchronous Recovery~ roll_cohortX = { Y | X can send messages to Y} Y will restart from thepermanent checkpoint only if last_label_rcvdY[X] > last_label_sentX[Y]
Recovery Algorithm ~Synchronous Recovery~ Algorithm • an initiator requests p∈ roll_cohortto prepare to rollback ( this message includes last_label_sent[reciever] of sender ) • if the processes that receive the request need to rollback, they do the same as 1.; otherwise, returnOKmessage. • they wait for receiving OK from all of p∈ ckpt_cohort. • if the initiator learns p∈ roll_cohort have succeeded, it decides to rollback; otherwise, not to rollback. • it informs p∈ roll_cohortof the decision • the processes that receive the decision act accordingly
Diagram of Synchronous Recovery decide to roll back [ [ A ab1 ba1 ba2 ac1 OK [ [ 2 > 1 0 > 1 B request to roll back ac2 cb2 cb1 bd1 [ [ C 2 > 1 dc1 dc1 dc2 [ D 0 >т 0 >т roll_cohortX = { Y | X can send messages to Y } last_label_rcvdY[X] > last_label_sentX[Y]
Drawbacks of Synchronous Approach • Additional messages are exchanged • Synchronization delay • An unnecessary extra load on the system if failure rarely occurs
Asynchronous Checkpoint Characteristic • Each process takes checkpoints independently • No guarantee that a set of local checkpoints is consistent • A recovery algorithm has to search consistent set of checkpoints • No additional message • No synchronization delay • Lighter load during normal execution
Preliminary (Assumptions) ~Asynchronous Checkpoint / Recovery~ Goal To find the latest consistent set of checkpoints Assumptions • Communication channels are FIFO • Communication channels are reliable • The underlying computation is event-driven
Asynchronous Checkpointing • Basic idea • Save states, messages sent at each event • Volatile logging • Each processor notes number of messages sent to others, and received from others • Use counters to determine orphan messages
Preliminary (Two types of log) ~Asynchronous Checkpoint / Recovery~ • save an event on the memory at receipt of messages (volatile log) • volatile log periodically flushed to the disk (stable log) ⇔ checkpoint volatile log : quick accesslost if the corresponding processor fails stable log : slow accessnot lost even if processors fail
Preliminary (Definition) ~Asynchronous Checkpoint / Recovery~ Definition CkPti : the checkpoint (stable log) that i rolled back to when failure occurs RCVDi←j (CkPti / e ) :the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e. SENTi→j(CkPti / e ) :the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e
Recovery Algorithm ~Asynchronous Checkpoint / Recovery~ Algorithm • When one process crashes, it recovers tothe latest checkpoint CkPt. • It broadcasts the message that it had failed. Others receive this message, and rollback to the latest event. • Each process sends SENT(CkPt) to neighboring processes • Each process waits for SENT(CkPt) messages from every neighbor • On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j (CkPti) > SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e) = SENTj→i(e), • repeat 3,4,and 5 N times (N is the number of processes)
Asynchronous Recovery X:Y X:Z x1 Ex0 Ex1 Ex2 Ex3 [ X 2 <= 2 3 <= 2 0 <= 0 (X,2) (Z,0) (Y,2) Y:X Y:Z y1 Ey0 Ey1 Ey2 Ey3 [ 1 <= 2 1 <= 1 Y (X,0) (Z,1) (Y,1) Z:X Z:Y Ez1 Ez2 Ez0 [ 0 <= 0 1 <= 1 2 <= 1 Z z1 RCVDi←j (CkPti) <= SENTj→i(CkPtj)