340 likes | 358 Views
Recovery. Failure of a site/node in a distributed system causes inconsistencies in the state of the system. Recovery: bringing back the failed node in step with other nodes in the system. Failures: Process failure: Deadlocks, protection violation, erroneous user input, etc. System failure:
E N D
Recovery • Failure of a site/node in a distributed system causes inconsistencies in the state of the system. • Recovery: bringing back the failed node in step with other nodes in the system. • Failures: • Process failure: • Deadlocks, protection violation, erroneous user input, etc. • System failure: • Failure of processor/system. System failure can have full/partial amnesia. • It can be a pause failure (system restarts at the same state it was in before the crash) or a complete halt. • Secondary storage failure: data inaccessible. • Communication failure: network inaccessible. B. Prabhakaran
Fault-to-Recovery Fault Manufacturing Design External Fatigue Erroneous System State System failure B. Prabhakaran
Backward & Forward Recovery • Forward Recovery: • Assess damages that could be caused by faults, remove those damages (errors), and help processes continue. • Difficult to do forward assessment. Generally tough. • Backward Recovery: • When forward assessment not possible. Restore processes to previous error-free state. • Expensive to rollback states • Does not eliminate same fault occurring again (i.e. loop on a fault + recovery) • Unrecoverable actions: print outs, cash dispensed at ATMs. B. Prabhakaran
Recovery System Model • For Backward Recovery • A single system with secondary and stable storage • Stable storage does not lose information on failures • Stable storage used for logs and recovery points • Stable storage assumed to be more secure than secondary storage. • Data on secondary storage assumed to be archived periodically. B. Prabhakaran
Approaches • Operation-based Approach • Maintaining logs: all modifications to the state of a process are recorded in sufficient detail so that a previous state can be restored by reversing all changes made to the state. • (e.g.,) Commit in database transactions: a transaction if it is committed to by all nodes, then the changes are permanent. If it does not commit, the effect of transactions are to be undone. • Updating-in-place: Every write (update) results in a log of (1) object name (2) old object state (3) new state. Operations: • A do operation updates & writes the log • An undo operation uses the log to remove the effect of a do • A redo operation uses the log to repeat a do • Write-ahead-log: To avoid the problem of a crash after update and before logging. • Write (undo & redo) logs before update B. Prabhakaran
Approaches • State-based Approach • Establish a recovery point where the process state is saved. • Recovery done by restoring the process state at the recovery, called a checkpoint. This process is called rollback. • Process of saving called checkpointing or taking a check point. • Rollback normally done to the most recent checkpoint, hence many checkpoints are done over the execution of a process. • Shadow pages technique can be used for checkpointing. Page containing the object to be updated is duplicated and maintained as a checkpoint in stable storage. • Actual update done on page in secondary storage. Copy in stable storage used for rollback. B. Prabhakaran
Recovery in Concurrent Systems • Distributed system state involves message exchanges. • In distributed systems, rolling back one process can cause the roll back of other processes. • Orphan messages & the Domino effect: Assume Y fails after sending m. • X has record of m at x3 but Y has no record. m -> orphan message. • Y rolls back to y2 -> X should go to x2. • If Z rolls back, X and Y has to go to x1 and y1 -> Domino effect, roll back of one process causes one or more processes to roll back. x1 x3 x2 X m y2 y1 Y z2 Z z1 B. Prabhakaran
Lost Messages • If Y fails after receiving m, it will rollback to y1. • X will rollback to x1 • m will be a lost message as X has recorded it as sent and Y has no record of receiving it. x1 X m y1 Y X Failure B. Prabhakaran
Livelocks x1 X n1 m1 y1 Y X Failure x1 X n2 n1 m2 y1 Y X 2nd Rollback • Y crashes before receiving n1. Y rolls back to Y1 -> X to x1. • Y recovers, receives n1 and sends m2. • X recovers, sends n2 but has no record of sending n1 • Hence, Y is forced to rollback second time. X also rolls back as it has • received m2 but Y has no record of m2. • Above sequence can repeat indefinitely, causing a livelock. B. Prabhakaran
Consistent Checkpoints • Overcoming domino effect and livelocks: checkpoints should not have messages in transit. • Consistent checkpoints: no message exchange between any pair of processes in the set as well as outside the set during the interval spanned by checkpoints. • {x1,y1,z1} is a strongly consistent checkpoint. x1 x3 x2 X m y2 y1 Y z2 Z z1 B. Prabhakaran
Synchronous Approach • Checkpointing: • First phase: • An initiating process, Pi, takes a tentative checkpoint. • Pi requests all other processes to take tentative checkpoints. • Every process informs whether it was able to take checkpoint. • A process can fail to take a checkpoint due to the nature of application (e.g.,) lack of log space, unrecoverable transactions. • Second phase: • If all processes took checkpoints, Pi decides to make the checkpoint permanent. • Otherwise, checkpoints are to be discarded. • Pi conveys this decision to all the processes as to whether checkpoints are to be made permanent or to be discarded. B. Prabhakaran
Assumptions: Synchronous Appr. • Processes communicate by exchanging messages through communication channels • Channels are FIFO in nature. • End-to-end protocols (e.g. TCP) are assumed to cope with message loss due to rollback recovery and communication failure. • Communication failures do not partition the network. • A process is not allowed to send messages between phase 1 and 2. B. Prabhakaran
Synchronous Approach... • Optimization: • Taking a checkpoint is expensive and the algorithm discussed may take unnecessary checkpoints. Initiate checkpointing x1 x3 x2 X y2 y3 y1 Y z2 z3 Z z1 w2 W w3 B. Prabhakaran
Synchronous Approach... • Optimization: • Taking a checkpoint is expensive and the algorithm discussed may take unnecessary checkpoints. Initiate checkpointing x3 x2 X y2 y3 Y z3 z2 Z w2 W w3 B. Prabhakaran
Checkpointing Optimization • Each process uses monotonically increasing labels in its outgoing messages. • Notations: • L: largest label. S: smallest label • Let m be the last message X received from Y after X’s last permanent checkpoint. last_label_recdx[Y] = m.l, if m exists. Otherwise, it is set to S. • Let m be the first message X sent to Y after checkpointing at X (permanent or temporary). first_label_sentx[Y] = m.l, if exists. Otherwise, set to L. • For a checkpointing request to Y, X sends last_label_recdx[Y]. • Y takes a temporary checkpoint iff last_label_recdx[Y] >= first_label_senty[X]. i.e., X has received 1 or more messages after checkpointing by Y and hence Y should take checkpoint. • ckpt_cohortx = {Y | last_label_recdx[Y] > S}, i.e., the set of all processes from which X has received messages after its checkpoint. B. Prabhakaran
Checkpointing Optimization • Initial state at all processes p: • first_label_sentp[q] := L. • OK-to_take_ckptp := “yes” if p is willing; “no” otherwise • At initiator Pi: • for all p in ckpt_cohortpi do send Take_a_tentative_ckpt (Pi,last_label_recdpi[p]) message • if all processes replied “yes”, then for all p in ckpt_cohortpi do send Make_tentative_ckpt_permanent. • Else send Undo_tentative_ckpt. • At all processes p: • Upon receiving Take_a_tentative_ckpt message from qdo • if OK_to_take_ckptp = “yes” AND last_label_recdq[p] >= first_label_sentp[q] • take a tentative checkpoint. B. Prabhakaran
Checkpointing Optimization... • At all processes p: • take a tentative checkpoint. • for all processes r in ckpt_cohortp do send Take_a_tentative_ckpt (p,last_label_recdp[r]) message • if all processes r replied “yes” OK_to_take_ckptp := “yes • else OK_to_take_ckptp := “no” • send (p, OK_to_take_ckptp) to q. • Upon receiving Make_tentative_ckpt_permanent message do • Make tentative checkpoint permanent • for all processes r in ckpt_cohortp do Send Make_tentative_ckpt_permanent message • Upon receiving Undo_tentative_ckpt message do • Undo tentative checkpoint • for all processes r in ckpt_cohortp do Send Undo_tentative_ckpt message. B. Prabhakaran
Synchronous Rollback • Rolling back: • First phase: • Pi initiates a rollback asking if all processes are willing to rollback to the previous checkpoint. • Any process may say no, if it is involved in another recovery process. • Second phase: • Pi conveys the decision on agreement to all others. Failure x1 x2 X X y2 y1 Y z2 Z z1 B. Prabhakaran
Rollback Optimization • Additional Notation: • last_label_sentx[Y] = m.l, if m exists. Otherwise, set to S. • When X requests Y to restart from the permanent checkpoint, it sends last_label_sentx[Y] along with its request. Y will restart from its permanent checkpoint only if: last_label_recdy[X] > last_label_sentx[Y] • roll_cohortx = {Y | X can send messages to Y} • Algorithm: • Initial State at all processes p: • resume_executionp := true; • for all processes q do last_label_recdp[q] := S; • willing_to_rollp = “yes” if p is willing to roll back. “no” otherwise. • At initiator process Pi: • for all p in roll_cohortp do send Prepare_to_rollback (Pi, last_label_sentPi[p]) message. B. Prabhakaran
Rollback Optimization... • At initiator process Pi... • if all processes reply “yes”, then for all p in roll_cohortp do send Roll_back message. • else for all p in roll_cohortpi do send Donot_roll_back message. • At all processes p: • Upon receiving Prepare_to_rollback (q,last_label_sentq[p]) message from q do • if willing_to_rollp AND last_label_recdp[q] > last_label_sentq[p] AND (resume_executionp) • resume_executionp := false; • for all r in roll_cohortp do send Prepare_to_rollback(p, last_label_sentp[r]) message; • if all r in roll_cohortp replied “yes” then willing_to_rollp := “yes” • else willing_to_rollp := “no” • send (p, willing_to_rollp) message to q B. Prabhakaran
Rollback Optimization... • At all processes p: • Upon receiving Roll_back message AND if resume_executionp = false do • restart from p’s permanent checkpoint • for all r in roll_cohortp do send Roll_back message • Upon receiving Donot_roll_back message do • resume execution • for all r in roll_cohortp do send Donot_roll_back message B. Prabhakaran
Rollback Optimization... (3) x1 X X (2) (0) y1 (4) Y (3) (0) (3) Z z1 (4) Label • X rolls back to x1. Y & Z to y1 and z1. B. Prabhakaran
Rollback Optimization... (3) x1 X X (0) y1 (4) Y (0) Z z1 (3) (4) Label • Both Y & Z do not roll back. X rolls back to x1 • Message 3 will be handled by retransmission of network protocol • (e.g., TCP) B. Prabhakaran
Asynchronous Approach • Disadvantages of Synchronous Approach: • Additional message exchanges for taking checkpoints • Delays normal executions as messages cannot be exchanged during checkpointing. • Unnecessary overhead if no failures occur between checkpoints. • Asynchronous approach: independent checkpoints at each processor. Identify a consistent set of checkpoints if needed, for roll backs. • E.g., {x3,y3,z2} not consistent; {x2,y2,z2} consistent. Used for rollback x1 x3 x2 X y3 y1 y2 Y z2 Z z1 B. Prabhakaran
Asynchronous Approach... • Assumption: 2 types of logging. • Volatile logging: takes less time but contents lost on failure. Periodically flushed to stable logs. • Stable log: may take more time but contents not lost. • Logging: tuple {s, m, msgs_sent}. s process state, m message received, msgs_sent the set of messages sent during the event. • Event logging initiated on message receipt. • Notations & data structures: • RCVDi<-j (CkPti): Number of messages received by processor Pi from Pj as per checkpoint CkPti. • SENTi->j(CkPti): Number of messages sent by processor Pi to Pj as per checkpoint CkPti. • Basic Idea: • Each processor keeps track of the number of messages sent/ received to/ from other processors. B. Prabhakaran
Asynchronous Approach... • Basic Idea .... • Existence of orphan messages identified by comparing the number of messages sent and received. • If number of received messages > sent messages -> presence of orphans -> receiving process needs to rollback. • Algorithm: • A recovering processor broadcasts a message to all processors. • if Pi is the recovering processor, CkPti := latest stable log. • else CkPti := latest event that took place in i. • for k := 1 to N do (N the total number of processors in the system) • for each neighboring processor j do send ROLLBACK (i,SENTi->j(CkPti)) message. • Wait for ROLLBACK message from every neighbor. B. Prabhakaran
Asynchronous Approach... • Algorithm ... • for every ROLLBACK(j,c) message received from a neighbor j, i does the following: • if RCVDi<-j(CkPti) > c then /* orphans present */ • find the latest event e such that RCVDi<-j(e) = c; • CkPti := e. • end for k. • Algorithm has |N| iterations. • During kth (k != 1) iteration, Pi based CkPti determined in (k-1)th iteration, computes SENTi->j(CkPti) for each neighbor. • This value is sent in a ROLLBACK message (in kth iteration) • At the end of each iteration, at least 1 processor will roll back to its final recovery point. B. Prabhakaran
Asynch. Approach Example ex2 ex3 x1 ex1 X failure • Y fails, restarts from y1. CkPtx is ex3 & CkPtz is ez2. • 1st iteration: • Y sends RollBack(Y,2) to X & RollBack(Y,1) to Z • X sends RollBack(X,1) to Y & RollBack(X,0) to Z • Z send RollBack(Z,0) to X & RollBack(Z,1) to Y. • Discussion: • RCVDx<-y(CkPtx) = 3 > 2 (in Y’s RollBack message) CkPtx set to ex2 to match the equality constraint. • RCVDz<-y(CkPtz) = 2 > 1 (in Y’s message) CkPtz set to ez1. ey1 ey2 ey3 Y X y1 ez1 Z ez2 z1 B. Prabhakaran
Asynch. Approach Example.. • Discussion... • At Y, RCVDy<-x and RCVDy<-z satisfy the constraints. So CkPty is unchanged at y1. • 2n d iteration: • Y sends RollBack(Y,2) to X & RollBack(Z,1) to Z • X sends RollBack(X,0) to Z & RollBack(X,1) to Y • Z sends RollBack(Z,1) to Y & RollBack(Z,0) to X • Checkpoint y1 is same as ey2. • {ex2, y1/ey2, ez1} are identified as consistent checkpoints to rollback. B. Prabhakaran
Distributed Databases • Checkpointing objectives in distributed database systems (DDBS): • Normal operations should be minimally interfered with, by checkpointing. • A DDBS may update different objects in different sites, local checkpointing at each site is better. • For faster recovery, checkpoints be consistent (desirable property). • Activity in DDBS is in terms of transactions. So in DDBS, a consistent checkpoint should either include updates of a transaction completely or not include it all. • Issues in identifying checkpoints: • How sites agree on what transactions are to be included • Taking checkpoints without interference B. Prabhakaran
DDBS Checkpointing • Assumptions: • Basic unit of activity is transactions • Transactions follow some concurrency control protocol • Lamport’s logical clocks used for time-stamping transactions. • Failures detected by network protocols or timeouts • Network partitioning never occurs • Basic Idea • All sites agree on a Global Checkpoint Number (GCPN) • Transactions with timestamps <= GCPN are included in the checkpoint. Called BCPTs: Before Checkpoint Transactions. • Timestamps of After Checkpoint Transactions (ACPTs) > GCPN. • Each site multiple versions of data items being updated by ACPTs in volatile storage -> No interference during checkpointing. B. Prabhakaran
DDBS Checkpointing ... • Data Structures • LC: local clock as per Lamport’s logical clock • LCPN (local checkpoint number): determined locally for the current checkpoint. • Algorithm: initiated by checkpoint coordinator (CC). CC uses checkpoint subordinates (CS). • Phase 1 at the CC • CC broadcasts a Checkpoint_Request message with a local timestamp LCcc. • LCPNcc := LCcc • CONVERTcc := false • Wait for replies from CSs. • Phase 1 at CSs B. Prabhakaran
DDBS Checkpointing ... • Phase 1 at CSs • On receiving a Checkpoint_Request message, a site m, updates its local clock as LCm := MAX(LCm, LCcc+1) • LCPNm := LCm • m informs LCPNm to the CC • CONVERTm := false • m marks all the transactions with timestamps !> LCPNm as BCPTs and the rest as temporary-ACPTs. • All updates of temporary-ACPTs are stored in the buffers of the ACPTs • If a temporary-ACPT commits, updates are not flushed to the database but maintained as committed temporary versions (CTVs). • Other transactions access CTVs for reads. For writes, another version of CTV is created. B. Prabhakaran
DDBS Checkpointing ... • Phase 2 at CC • All CS’s replies received -> GCPN := Max(LCPN1, .., LCPNn) • Broadcast GCPN • Phase 2 at the CSs • On receiving GCPN, m marks all temporary-ACPTs that satisfy the following conditions as BCPTs: • LCPNm < transaction time stamp <= GCPN • Updates of the above converted BCPTs are included in checkpoints • CONVERTm := true (i.e., GCPN & BCPTs identified) • When all BCPTs terminate and CONVERTm = true, m takes a local checkpoint by saving the state of the data objects. • After local checkpointing, database is updated with CTVs and CTVs are deleted. B. Prabhakaran