1 / 35

Recovery Chapter 12

Recovery Chapter 12. Recovery. Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step with other nodes in the system. Classification of Failures: Process failure:

adli
Download Presentation

Recovery Chapter 12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RecoveryChapter 12

  2. Recovery • Failure of a site/node in a distributed system causes inconsistencies in the state of the system. • Recovery: bringing back the failed node in step with other nodes in the system. • Classification of Failures: • Process failure: • Deadlocks, protection violation, erroneous user input, etc. • System failure: • Failure of processor/system. System failure can have full/partial amnesia. • It can be a pause failure (system restarts at the same state it was in before the crash) or a complete halt. • Secondary storage failure: data inaccessible. • Communication failure: network inaccessible.

  3. Fault-to-Recovery Fault Manufacturing Design External Fatigue Erroneous System State System failure

  4. Backward & Forward Recovery • Forward Recovery: • Assess damages that could be caused by faults, remove those damages (errors), and help processes continue. • Difficult to do forward assessment. Generally tough. • Backward Recovery: • When forward assessment not possible. Restore processes to previous error-free state. • Expensive to rollback states • Does not eliminate same fault occurring again (i.e. loop on a fault + recovery) • Unrecoverable actions: print outs, cash dispensed at ATMs.

  5. Recovery System Model • For Backward Recovery • A single system with secondary and stable storage • Stable storage does not lose information on failures • Stable storage used for logs and recovery points • Stable storage assumed to be more secure than secondary storage. • Data on secondary storage assumed to be archived periodically.

  6. Approaches • Operation-based Approach • Maintaining logs: all modifications to the state of a process are recorded in sufficient detail so that a previous state can be restored by reversing all changes made to the state. • (e.g.,) Commit in database transactions: a transaction if it is committed to by all nodes, then the changes are permanent. If it does not commit, the effect of transactions are to be undone. • Updating-in-place: Every write (update) results in a log of (1) object name (2) old object state (3) new state. Operations: • A do operation updates & writes the log • An undo operation uses the log to remove the effect of a do • A redo operation uses the log to repeat a do • Write-ahead-log: To avoid the problem of a crash after update and before logging. • Write (undo & redo) logs before update

  7. Approaches • State-based Approach • Establish a recovery point where the process state is saved. • Recovery done by restoring the process state at the recovery, called a checkpoint. This process is called rollback. • Process of saving called checkpointing or taking a check point. • Rollback normally done to the most recent checkpoint, hence many checkpoints are done over the execution of a process. • Shadow pages technique can be used for checkpointing. Page containing the object to be updated is duplicated and maintained as a checkpoint in stable storage. • Actual update done on page in secondary storage. Copy in stable storage used for rollback.

  8. Recovery in Concurrent Systems • Distributed system state involves message exchanges. • In distributed systems, rolling back one process can cause the roll back of other processes. • Orphan messages & the Domino effect: Assume Y fails after sending m. • X has record of m at x3 but Y has no record. m -> orphan message. • Y rolls back to y2 -> X should go to x2. • If Z rolls back, X and Y has to go to x1 and y1 -> Domino effect, roll back of one process causes one or more processes to roll back. x1 x3 x2 X m y2 y1 Y z2 Z z1

  9. Lost Messages • If Y fails after receiving m, it will rollback to y1. • X will rollback to x1 • m will be a lost message as X has recorded it as sent and Y has no record of receiving it. x1 X m y1 Y X Failure

  10. Livelocks x1 X n1 m1 y1 Y X Failure x1 X n2 n1 m2 y1 Y X 2nd Rollback • Y crashes before receiving n1. Y rolls back to Y1 -> X to x1. • Y recovers, receives n1 and sends m2. • X recovers, sends n2 but has no record of sending n1 • Hence, Y is forced to rollback second time. X also rolls back as it has • received m2 but Y has no record of m2. • Above sequence can repeat indefinitely, causing a livelock.

  11. Consistent Checkpoints • Overcoming domino effect and livelocks: checkpoints should not have messages in transit. • Consistent checkpoints: no message exchange between any pair of processes in the set as well as outside the set during the interval spanned by checkpoints. • {x1,y1,z1} is a strongly consistent checkpoint. x2 x1 X m y2 y1 Y z2 Z z1 Time

  12. Synchronous Approach • Checkpointing: • First phase: • An initiating process, Pi, takes a tentative checkpoint. • Pi requests all other processes to take tentative checkpoints. • Every process informs whether it was able to take checkpoint. • A process can fail to take a checkpoint due to the nature of application (e.g.,) lack of log space, unrecoverable transactions. • Second phase: • If all processes took checkpoints, Pi decides to make the checkpoint permanent. • Otherwise, checkpoints are to be discarded. • Pi conveys this decision to all the processes as to whether checkpoints are to be made permanent or to be discarded.

  13. Assumptions: Synchronous Appr. • Processes communicate by exchanging messages through communication channels • Channels are FIFO in nature. • End-to-end protocols (e.g. TCP) are assumed to cope with message loss due to rollback recovery and communication failure. • Communication failures do not partition the network. • A process is not allowed to send messages between phase 1 and 2.

  14. Synchronous Approach... • Optimization: • Taking a checkpoint is expensive and the algorithm discussed may take unnecessary checkpoints. Initiate checkpointing x1 x3 x2 X y2 y3 y1 Y z2 z3 Z z1 w2 W w3

  15. Synchronous Approach... • Optimization: • Taking a checkpoint is expensive and the algorithm discussed may take unnecessary checkpoints. Initiate checkpointing x3 x2 X y2 y3 Y z3 z2 Z w2 W w3

  16. Checkpointing Optimization • Each process uses monotonically increasing labels in its outgoing messages. • Notations: • L: largest label. S: smallest label • Let m be the last message X received from Y after X’s last permanent checkpoint. last_label_recdx[Y] = m.l, if m exists. Otherwise, it is set to S. • Let m be the first message X sent to Y after checkpointing at X (permanent or temporary). first_label_sentx[Y] = m.l, if exists. Otherwise, set to L. • For a checkpointing request to Y, X sends last_label_recdx[Y]. • Y takes a temporary checkpoint iff last_label_recdx[Y] >= first_label_senty[X} > S. i.e., X has received 1 or more messages after checkpointing by Y and hence Y should take checkpoint. • ckpt_cohortx = {Y | last_label_recdx[Y] > S}, i.e., the set of all processes from which X has received messages after its checkpoint.

  17. Checkpointing Optimization • Initial state at all processes p: • first_label_sentp[q] := S. • OK-to_take_ckptp := “yes” if p is willing; “no” otherwise • At initiator Pi: • for all p in ckpt_cohortpi do send Take_a_tentative_ckpt (Pi,last_label_recdpi[p]) message • if all processes replied “yes”, then for all p in ckpt_cohortpi do send Make_tentative_ckpt_permanent. • Else send Undo_tentative_ckpt. • At all processes p: • Upon receiving Take_a_tentative_ckpt message from qdo • if OK_to_take_ckptp = “yes” AND last_label_recdq[p] >= first_label_sentp[q] > S • take a tentative checkpoint.

  18. Checkpointing Optimization... • At all processes p: • take a tentative checkpoint. • for all processes r in ckpt_cohortp do send Take_a_tentative_ckpt (p,last_label_recdp[r]) message • if all processes r replied “yes” OK_to_take_ckptp := “yes • else OK_to_take_ckptp := “no” • send (p, OK_to_take_ckptp) to q. • Upon receiving Make_tentative_ckpt_permanent message do • Make tentative checkpoint permanent • for all processes r in ckpt_cohortp do Send Make_tentative_ckpt_permanent message • Upon receiving Undo_tentative_ckpt message do • Undo tentative checkpoint • for all processes r in ckpt_cohortp do Send Undo_tentative_ckpt message.

  19. Synchronous Rollback • Rolling back: • First phase: • Pi initiates a rollback asking if all processes are willing to rollback to the previous checkpoint. • Any process may say no, if it is involved in another recovery process. • Second phase: • Pi conveys the decision on agreement to all others. Failure x1 x2 X X y2 y1 Y z2 Z z1

  20. Rollback Optimization • Additional Notation: • last_label_sentx[Y] = m.l, if m exists. Otherwise, set to S. • When X requests Y to restart from the permanent checkpoint, it sends last_label_sentx[Y] along with its request. Y will restart from its permanent checkpoint only if: last_label_recdy[X] > last_label_sentx[Y] • roll_cohortx = {Y | X can send messages to Y} • Algorithm: • Initial State at all processes p: • resume_executionp := true; • for all processes q do last_label_recdp[q] := L; • willing_to_rollp = “yes” if p is willing to roll back. “no” otherwise. • At initiator process Pi: • for all p in roll_cohortp do send Prepare_to_rollback (Pi, last_label_sentPi[p]) message.

  21. Rollback Optimization... • At initiator process Pi... • if all processes reply “yes”, then for all p in roll_cohortp do send Roll_back message. • else for all p in roll_cohortpi do send Donot_roll_back message. • At all processes p: • Upon receiving Prepare_to_rollback (q,last_label_sentq[p]) message from q do • if willing_to_rollp AND last_label_recdp[q] > last_label_sentq[p] AND (resume_executionp) • resume_executionp := false; • for all r in roll_cohortp do send Prepare_to_rollback(p, last_label_sentp[r]) message; • if all r in roll_cohortp replied “yes” then willing_to_rollp := “yes” • else willing_to_rollp := “no” • send (p, willing_to_rollp) message to q

  22. Rollback Optimization... • At all processes p: • Upon receiving Roll_back message AND if resume_executionp = false do • restart from p’s permanent checkpoint • for all r in roll_cohortp do send Roll_back message • Upon receiving Donot_roll_back message do • resume execution • for all r in roll_cohortp do send Donot_roll_back message

  23. Rollback Optimization... (3) x1 X X (2) (0) y1 (4) Y (3) (0) (3) Z z1 (4) Label • X rolls back to x1. Y & Z to y1 and z1.

  24. Rollback Optimization... (3) x1 X X (0) y1 (4) Y (0) Z z1 (3) (4) Label • Both Y & Z do not roll back. X rolls back to x1 • Message 3 will be handled by retransmission of network protocol • (e.g., TCP)

  25. Asynchronous Approach • Disadvantages of Synchronous Approach: • Additional message exchanges for taking checkpoints • Delays normal executions as messages cannot be exchanged during checkpointing. • Unnecessary overhead if no failures occur between checkpoints. • Asynchronous approach: independent checkpoints at each processor. Identify a consistent set of checkpoints if needed, for roll backs. • E.g., {x3,y3,z2} not consistent; {x2,y2,z2} consistent. Used for rollback x1 x3 x2 X y3 y1 y2 Y z2 Z z1

  26. Asynchronous Approach... • Assumption: 2 types of logging. • Volatile logging: takes less time but contents lost on failure. Periodically flushed to stable logs. • Stable log: may take more time but contents not lost. • Logging: tuple {s, m, msgs_sent}. s process state, m message received, msgs_sent the set of messages sent during the event. • Event logging initiated on message receipt. • Notations & data structures: • RCVDi<-j (CkPti): Number of messages received by processor Pi from Pj as per checkpoint CkPti. • SENTi->j(CkPti): Number of messages sent by processor Pi to Pj as per checkpoint CkPti. • Basic Idea: • Each processor keeps track of the number of messages sent/ received to/ from other processors.

  27. Asynchronous Approach... • Basic Idea .... • Existence of orphan messages identified by comparing the number of messages sent and received. • If number of received messages > sent messages -> presence of orphans -> receiving process needs to rollback. • Algorithm: • A recovering processor broadcasts a message to all processors. • if Pi is the recovering processor, CkPti := latest stable log. • else CkPti := latest event that took place in i. • for k := 1 to N do (N the total number of processors in the system) • for each neighboring processor j do send ROLLBACK (i,SENTi->j(CkPti)) message. • Wait for ROLLBACK message from every neighbor.

  28. Asynchronous Approach... • Algorithm ... • for every ROLLBACK(j,c) message received from a neighbor j, i does the following: • if RCVDi<-j(CkPti) > c then /* orphans present */ • find the latest event e such that RCVDi<-j(e) = c; • CkPti := e. • end for k. • Algorithm has |N| iterations. • During kth (k != 1) iteration, Pi based CkPti determined in (k-1)th iteration, computes SENTi->j(CkPti) for each neighbor. • This value is sent in a ROLLBACK message (in kth iteration) • At the end of each iteration, at least 1 processor will roll back to its final recovery point.

  29. Asynch. Approach Example ex2 ex3 x1 ex1 • Y fails, restarts from y1. CkPtx is ex3 & CkPtz is ez2. • 1st iteration: • Y sends RollBack(Y,2) to X & RollBack(Y,1) to Z • X sends RollBack(X,1) to Y & RollBack(X,0) to Z • Z send RollBack(Z,0) to X & RollBack(Z,1) to Y. • Discussion: • RCVDx<-y(CkPtx) = 3 > 2 (in Y’s RollBack message) CkPtx set to ex2 to match the equality constraint. • RCVDz<-y(CkPtz) = 2 > 1 (in Y’s message) CkPtz set to ez1. X failure ey1 ey2 ey3 Y X y1 ez1 Z ez2 z1

  30. Asynch. Approach Example.. • Discussion... • At Y, RCVDy<-x and RCVDy<-z satisfy the constraints. So CkPty is unchanged at y1. • 2n d iteration: • Y sends RollBack(Y,2) to X & RollBack(Z,1) to Z • X sends RollBack(X,0) to Z & RollBack(X,1) to Y • Z sends RollBack(Z,1) to Y & RollBack(Z,0) to X • Checkpoint y1 is same as ey2. • {ex2, y1/ey2, ez1} are identified as consistent checkpoints to rollback.

  31. Distributed Databases • Checkpointing objectives in distributed database systems (DDBS): • Normal operations should be minimally interfered with, by checkpointing. • A DDBS may update different objects in different sites, local checkpointing at each site is better. • For faster recovery, checkpoints be consistent (desirable property). • Activity in DDBS is in terms of transactions. So in DDBS, a consistent checkpoint should either include updates of a transaction completely or not include it all. • Issues in identifying checkpoints: • How sites agree on what transactions are to be included • Taking checkpoints without interference

  32. DDBS Checkpointing • Assumptions: • Basic unit of activity is transactions • Transactions follow some concurrency control protocol • Lamport’s logical clocks used for time-stamping transactions. • Failures detected by network protocols or timeouts • Network partitioning never occurs • Basic Idea • All sites agree on a Global Checkpoint Number (GCPN) • Transactions with timestamps <= GCPN are included in the checkpoint. Called BCPTs: Before Checkpoint Transactions. • Timestamps of After Checkpoint Transactions (ACPTs) > GCPN. • Each site multiple versions of data items being updated by ACPTs in volatile storage -> No interference during checkpointing.

  33. DDBS Checkpointing ... • Data Structures • LC: local clock as per Lamport’s logical clock • LCPN (local checkpoint number): determined locally for the current checkpoint. • Algorithm: initiated by checkpoint coordinator (CC). CC uses checkpoint subordinates (CS). • Phase 1 at the CC • CC broadcasts a Checkpoint_Request message with a local timestamp LCcc. • LCPNcc := LCcc • CONVERTcc := false • Wait for replies from CSs. • Phase 1 at CSs

  34. DDBS Checkpointing ... • Phase 1 at CSs • On receiving a Checkpoint_Request message, a site m, updates its local clock as LCm := MAX(LCm, LCcc+1) • LCPNm := LCm • m informs LCPNm to the CC • CONVERTm := false • m marks all the transactions with timestamps !> LCPNm as BCPTs and the rest as temporary-ACPTs. • All updates of temporary-ACPTs are stored in the buffers of the ACPTs • If a temporary-ACPT commits, updates are not flushed to the database but maintained as committed temporary versions (CTVs). • Other transactions access CTVs for reads. For writes, another version of CTV is created.

  35. DDBS Checkpointing ... • Phase 2 at CC • All CS’s replies received -> GCPN := Max(LCPN1, .., LCPNn) • Broadcast GCPN • Phase 2 at the CSs • On receiving GCPN, m marks all temporary-ACPTs that satisfy the following conditions as BCPTs: • LCPNm < transaction time stamp <= GCPN • Updates of the above converted BCPTs are included in checkpoints • CONVERTm := true (i.e., GCPN & BCPTs identified) • When all BCPTs terminate and CONVERTm = true, m takes a local checkpoint by saving the state of the data objects. • After local checkpointing, database is updated with CTVs and CTVs are deleted.

More Related