350 likes | 534 Views
12. Recovery. Study Meeting M1 Yuuki Horita 2004/5/14. Contents. Introduction Recovery Checkpointing Difficulty of Checkpointing Synchronous checkpointing / recovery ( Asynchronous checkpointing / recovery ). Introduction. Long computation in distributed environments High failure rate
E N D
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14
Contents • Introduction • Recovery • Checkpointing • Difficulty of Checkpointing • Synchronous checkpointing / recovery • (Asynchronous checkpointing / recovery)
Introduction • Long computation in distributed environments • High failure rate • Host failure (a lot of hosts) • Network failure One failure may disturb entire computation ⇒ Need to start it again from the beginning • High cost Why don’t we utilize the previous computation? Recovery
Recovery is not easy Suppose that a parallel computation is running in distributed resources… 1 7 8 1 7 8 1 7 1 7 for(i=0; i<MAXITER; i++){ local_compute(); // compute at each host global_state_exchange(); // communicate with neighbors } • need to save process states periodically • usually other processes have to restore to previous state • overhead
Back/Forward Error Recovery • Forward-error recovery • Only when it is possible to remove errors • Enable processes to move forward • Ex) Redundancy, vote • Backward-error recovery • General • Restore to a previous error-free state • Ex) Checkpoint
Backward-error recovery • operational-based approach • Record all modifications of a process’ state • state-based approach • Record complete state at certain point
State-based approach Terminology • checkpointingthe process of saving state • checkpoint the recovery point at which checkpointing occurs • rolling back the process of restoring a process to a prior-state
Problem of naïve checkpointing • Orphan Messages and the Domino Effect • Orphan message :a message that make an inconsistent state • Domino Effect : what a single rolling back induce other rolling back • Lost Messages • Livelocks
Orphan message and Domino Effect x1 x2 x3 [ [ [ X Y has not sent yet, but X has received. y1 y2 : Orphan message [ [ Y Roll back z1 z2 [ [ Z : Domino Effect
Lost messages x1 x2 x3 [ [ [ X X has sent, but Y cannot receive forever y1 y2 : Lost message [ [ Y Roll back z1 z2 [ [ Z
Livelocks x1 [ X n2 n1 m2 m1 n1 y1 Y [
Consistency of Checkpoint • Strongly consistent set of checkpoints no messages penetrating the set • Consistent set of checkpoints no messages penetrating the set backward x1 x2 [ [ need to deal with lost messages y1 y2 [ [ Strongly consistent consistent z1 z2 [ [
Checkpoint/Recovery Algorithm • Synchronous • with global synchronization at checkpointing • Asynchronous • without global synchronization at checkpointing
Preliminary (Assumption) ~Synchronous Checkpoint~ Goal To make a consistent global checkpoint Assumptions • Communication channels are FIFO • No partition of the network • End-to-end protocols cope with message loss due to rollback recovery and communication failure • No failure during the execution of the algorithm
Preliminary (Two types of checkpoint) ~Synchronous Checkpoint~ tentative checkpoint: • a temporary checkpoint • a candidate for permanent checkpoint permanent checkpoint: • a local checkpoint at a process • a part of a consistent global checkpoint
Checkpoint Algorithm ~Synchronous Checkpoint~ Algorithm • an initiating process (a single process that invokes this algorithm) takes a tentative checkpoint • it requests all the processes to take tentative checkpoints • it waits for receiving from all the processes whether taking a tentative checkpoint has been succeeded • if it learns all the processes has succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. • it informs all the processes of the decision • The processes that receive the decision act accordingly Supplement Once a process has taken a tentative checkpoint, it shouldn’t send messages until it is informed of initiator’s decision.
Diagram of Checkpoint Algorithm ~Synchronous Checkpoint~ Tentative checkpoint decide to commit Initiator permanent checkpoint [ | [ request to take a tentative checkpoint OK [ | [ [ | [ consistent global checkpoint Unnecessary checkpoint consistent global checkpoint
Optimized Algorithm ~Synchronous Checkpoint~ Each message is labeled by order of sending Labeling Scheme ⊥ : smallest label т : largest label last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, ⊥is in it. first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . if not exists, ⊥is in it. ckpt_cohortX :the set of all processes that may haveto take checkpoints when X decides to take a checkpoint. [ X x3 x2 y1 y2 [ Y y2 x2 Checkpoint request need to be sent to only the processes included in ckpt_cohort
Optimized Algorithm ~Synchronous Checkpoint~ ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥} Y takes atentative checkpoint only if last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥ last_label_rcvdX[Y] [ X [ Y first_label_sentY[X]
Optimized Algorithm ~Synchronous Checkpoint~ Algorithm • an initiating process takes a tentative checkpoint • it requests p∈ ckpt_cohortto take tentative checkpoints ( this message includes last_label_rcvd[reciever] of sender ) • if the processes that receive the request need to take a checkpoint, they do the same as 1.2.; otherwise, return OK messages. • they wait for receiving OK from all of p∈ ckpt_cohort • if the initiator learns all the processes have succeeded, it decides all tentative checkpoints should be made permanent; otherwise, should be discarded. • it informs p∈ ckpt_cohortof the decision • The processes that receive the decision act accordingly
Diagram of Optimized Algorithm ~Synchronous Checkpoint~ Tentative checkpoint Permanent checkpoint decide to commit [ [ | A 2 >= 0 > 0 ab1 ba1 ba2 ac1 ca2 [ | [ 2 >= 1 > 0 B OK ac2 cb2 cb1 bd1 [ [ | 2 >= 2 > 0 C cd1 dc1 dc2 [ D ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ } last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
Correctness ~Synchronous Checkpoint~ • A set of permanent checkpoints taken by this algorithm is consistent • No process sends messages after taking a tentative checkpoint until the receipt of the decision • New checkpoints include no message from the processes that don’t take a checkpoint • The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.
Recovery Algorithm ~Synchronous Recovery~ Labeling Scheme ⊥ : smallest label т : largest label last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, ⊥is in it. first_label_sentX[Y] :the first message that X sent to Y after X took its last permanent or tentative checkpoint . If not exists, ⊥is in it. roll_cohortX :the set of all processes that may have to roll back to the latest checkpoint when process X rolls back. last_label_sentX[Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it.
Recovery Algorithm ~Synchronous Recovery~ roll_cohortX = { Y | X can send messages to Y} Y will restart from thepermanent checkpoint only if last_label_rcvdY[X] > last_label_sentX[Y]
Recovery Algorithm ~Synchronous Recovery~ Algorithm • an initiator requests p∈ roll_cohortto prepare to rollback ( this message includes last_label_sent[reciever] of sender ) • if the processes that receive the request need to rollback, they do the same as 1.; otherwise, returnOKmessage. • they wait for receiving OK from all of p∈ ckpt_cohort. • if the initiator learns p∈ roll_cohort have succeeded, it decides to rollback; otherwise, not to rollback. • it informs p∈ roll_cohortof the decision • the processes that receive the decision act accordingly
Diagram of Synchronous Recovery decide to roll back [ [ A ab1 ba1 ba2 ac1 OK [ [ 2 > 1 0 > 1 B request to roll back ac2 cb2 cb1 bd1 [ [ C 2 > 1 dc1 dc1 dc2 [ D 0 >т 0 >т roll_cohortX = { Y | X can send messages to Y } last_label_rcvdY[X] > last_label_sentX[Y]
Drawbacks of Synchronous Approach • Additional messages are exchanged • Synchronization delay • An unnecessary extra load on the system if failure rarely occurs
Asynchronous Checkpoint Characteristic • Each process takes checkpoints independently • No guarantee that a set of local checkpoints is consistent • A recovery algorithm has to search consistent set of checkpoints • No additional message • No synchronization delay • Lighter load during normal excution
Preliminary (Assumptions) ~Asynchronous Checkpoint / Recovery~ Goal To find the latest consistent set of checkpoints Assumptions • Communication channels are FIFO • Communication channels are reliable • The underlying computation is event-driven
Preliminary (Two types of log) ~Asynchronous Checkpoint / Recovery~ • save an event on the memory at receipt of messages (volatile log) • volatile log periodically flushed to the disk (stable log) ⇔ checkpoint volatile log : quick accesslost if the corresponding processor fails stable log : slow accessnot lost even if processors fail
Preliminary (Definition) ~Asynchronous Checkpoint / Recovery~ Definition CkPti : the checkpoint (stable log) that i rolled back to when failure occurs RCVDi←j (CkPti / e ) :the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e. SENTi→j(CkPti / e ) :the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e
Recovery Algorithm ~Asynchronous Checkpoint / Recovery~ Algorithm • When one process crashes, it recovers tothe latest checkpoint CkPt. • It broadcasts the message that it had failed. Others receive this message, and rollback to the latest event. • Each process sends SENT(CkPt) to neighboring processes • Each process waits for SENT(CkPt) messages from every neighbor • On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j (CkPti) > SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e) = SENTj→i(e), • repeat 3,4,and 5 N times (N is the number of processes)
Asynchronous Recovery X:Y X:Z x1 Ex0 Ex1 Ex2 Ex3 [ X 2 <= 2 3 <= 2 0 <= 0 (X,2) (Z,0) (Y,2) Y:X Y:Z y1 Ey0 Ey1 Ey2 Ey3 [ 1 <= 2 1 <= 1 Y (X,0) (Z,1) (Y,1) Z:X Z:Y Ez1 Ez2 Ez0 [ 0 <= 0 1 <= 1 2 <= 1 Z z1 RCVDi←j (CkPti) <= SENTj→i(CkPtj)