1 / 22

Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing

Design of Reliable Systems and Networks ECE 442 Checkpointing II Distributed Checkpointing & Recovery. Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign i yer@crhc.uiuc.edu.

wyatt
Download Presentation

Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design of Reliable Systems and NetworksECE 442 Checkpointing II Distributed Checkpointing & Recovery Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

  2. Outline • Checkpoint and recovery in distributed/networked systems • Global consistent state • Recovery line • Synchronous checkpointing and recovery

  3. Recovery in Distributed/Networked Systems • Processes cooperate by exchanging information to accomplish a task • Message passing (distributed systems) • Shared memory (e.g., multiprocessor systems) • Rollback of one process may require that other processes also roll back to an earlier state. • All cooperating processes need to establish recovery points. • Rolling back processes in concurrent systems is more difficult than for a single process due to • Domino effect • Lost messages • Livelocks

  4. Domino Effect x3 x1 x2 X y1 y2 Y z2 z1 Z Time X, Y, Z - cooperating processes • Rollback of X does not affect other processes. • Rollback of Z requires all three processes to roll back to their very first recovery points. [ - recovery points

  5. Lost Messages and Livelocks Lost Messages X x1 Time Message loss due to rollback recovery m failure y1 X Y Livelocks Livelock is a situation in which a single failure can cause an infinite number of rollbacks, preventing the system from making progress x1 x1 x1 Time Time Time X X X n2 n1 n1 m1 y1 y1 m2 y1 Y Y X Y  failure roll back to restore consistent global state

  6. Consistent Set of Checkpoints: Recovery Lines • A strongly consistent set of checkpoints (recovery line) corresponds to a strongly consistent global state. • there is one recovery point for each process in the set during the interval spanned by checkpoints, there is no information flow between any • pair of processes in the set • a process in the set and any process outside the set • A consistent set of checkpoints corresponds to a consistent global state. Set {x1, y1, z1} is a strongly consistent set of checkpoints Set {x2, y2, z2} is a consistent set of checkpoints (need to handle lost messages) x1 x2 X y1 y2 Y z2 z1 Z Time

  7. Networked/Distributed Systems: Local State • For a site (computer, process) Si, its local state LSi, at a given time is defined by the local context of the distributed application. Let’s denote: send(mij)- send event of a message mij by Si to Sj rec(mij)- receive event of message mij by site Sj time(x) - time in which state x was recorded • We say that send(mij)  LSi iff time(send(mij)) < time(LSi) rec(mij)  LSj iff time(rec(mij)) < time(LSj) • Two sets of messages are defined for sites Si and Sj • Transit transit(LSi, LSj) = {mij | send(mij)  LSi  rec(mij)  LSj} • Inconsistent inconsistent (LSi, LSj) = {mij | send(mij) LSi rec(mij)  LSj}

  8. Networked/Distributed Systems: Global State • A global state (GS) of a system is a collection of the local states of its sites, i.e., GS = {LS1, LS2, …, LSn}, where n is the number of sites in the system. • Consistent global state: A global state GS is consistent iff i, j : 1  i, j  n :: inconsistent(LSi, LSj) =  • Transitless global state: A global state GS transitless iff i, j : 1  i, j  n :: transit(LSi, LSj) =  • Strongly consistent global state: A global state that is both consistent and transitless For every received message a corresponding send event is recorded in the global state All communication channels are empty

  9. Networked/Distributed SystemsLocal/Global State - Examples The global states: • GS1 = {LS11, LS21, LS31} is a strongly consistent global state. • GS2 = {LS12, LS23, LS33} is a consistent global state . • GS3 = {LS11, LS22, LS32} is an inconsistent global state. LS11 Time LS12 S1   LS21 LS23 LS22 S2    LS31 LS33 LS32 S3   

  10. Synchronous Checkpointing and Recovery (Koo & Toueg) • Assumptions • Processes communicate by exchanging messages through communication channels • Channels are FIFO • End-to-end protocols (such a sliding window) are assumed to cope with message loss due to rollback recovery and communication failures • Communication failures do not partition the network • A single process invokes the algorithm • The checkpoint and the rollback recovery algorithms are not invoked concurrently.

  11. Synchronous Checkpointing and Recovery (Koo & Toueg) • Two types of checkpoints • Permanent - a local checkpoint at a process • Tentative - a temporary checkpoint that is made a permanent checkpoint on the successful termination of the checkpoint algorithm

  12. Checkpoint Algorithm • Phase One • Initiating process Pitakes a tentative checkpoint and requests that all the processes take tentative checkpoints. • Each process informs Piwhether it succeeded in taking a tentative checkpoint. • If Pi learns that all processes have taken tentative checkpoints, Pidecides that all tentative checkpoints should be made permanent. • Otherwise, Pi decides that all tentative checkpoints should be discarded. • Phase Two • Pi propagates its decision to all processes. • On receiving the message from Pi, all processes act accordingly.

  13. Checkpoint Algorithm (cont.) • Optimization of the checkpoint algorithm • A minimal number of processes take checkpoints • Processes use a labeling scheme to decide whether to take a checkpoint. • All processes from which Pi has received messages after it has taken its last checkpoint take a checkpoint to record the sending of those messages Tentative checkpoint x1 x2 Time X Messages to take a checkpoint m y1 y2 Y z2 z1 Z

  14.  m.l. if m exists  otherwise m.l. if m exists  otherwise last_label_rcvdX[Y] = first_label_sentX[Y] = Labeling Scheme • Each process uses monotonically increasing labels in its outgoing messages • For any two processes X and Y, let m be the last message that X received from Y after X has taken its last permanent or tentative checkpoint then • Let m be the first message that X sent to Y after X took its last permanent or tentative checkpoint then  = smallest label T = largest label

  15. Labeling Scheme (cont.) • When X requests Y to take a tentative checkpoint, X sends last_label_rcvdX[Y]along with its request; Y takes a tentative checkpoint only if last_label_rcvdX[Y]  first_label_sentY[X] >  • Checkpoint cohort - Set of all processes that should be asked to take a checkpoint initiated by X ckpt_cohortX = {Y | last_label_rcvdX[Y] >  } • The checkpoint at X has recorded the receipt of one or more messages sent by Y after Y took its last checkpoint; • Y should take a checkpoint to record the events that send those messages • All processes from which X has received messages after it has taken its last checkpoint • Those processes take a checkpoint to record the sending of those messages

  16. Rollback Recovery Algorithm • Phase One: • Process Pichecks whether all processes are willing to restart from their previous checkpoints. • A process may reply “no” if it is already participating in a checkpointing or recovering process initiated by some other process. • If all processes are willing to restart from their previous checkpoints, Pi decides that they should restart. • Otherwise, Pi decides that all the processes continue with their normal activities. • Phase Two: • Pipropagates its decision to all processes. • On receiving Pi’s decision, the processes act accordingly.

  17. Rollback Recovery Algorithm (cont.) • Optimization • A minimum number of processes roll back • Processes use a labeling scheme to decide whether they need to roll back • Y will restart from its permanent checkpoint only if X is rolling back to a state where the sending of one or more messages from X to Y is being undone x1 x2 Time X X Failure y1 y2 Y z2 z1 Z

  18. m.l if m exists T otherwise last_label_sentX[Y] = Labeling Scheme - extension • For any two processes X and Y, let m be the last message that X sent to Y before its last permanent checkpoint. Then • When X request Y to restart from the permanent checkpoint, it sends last_label_sentX[Y]along with this request • Y will restart from its permanent checkpoint only if last_label_rcvdY[X] > last_label_sentX[Y] X is rolling back to a state where the sending of one or more messages from X to Y is being undone

  19. Synchronous CheckpointingDisadvantages • Additional messages must be exchanged to coordinate checkpointing. • Synchronization delays are introduced during normal operations. • No computational messages can be sent while the checkpointing algorithm is in progress. • If failure rarely occurs between successive checkpoints, then the checkpoint algorithm places an unnecessary extra load on the system, which can significantly affect performance.

  20. IRIX Operating System (SGI) Checkpoint and Restart • Facility for saving running process(es) and, at some other time, restarting the saved process(es) from the point already reached, without starting all over again. • A checkpoint image is saved in a set of disk files and can comprise • A set of processes (one or more), e.g., $ cpr -c ckptSep7 -p 1234 where cpr-c is the checkpoint command, ckptSep7 is the statefile name, -p option allows to specify a process ID • All processes in the process group (a set of processes that constitute a logical job) • All processes in a process session (a set of processes started from the same physical or logical terminal) • All processes in an IRIX array session (a set of related processes running on different nodes in an array) • The array service daemon supports checkpointing across the nodes. • To restart a set of processes the cpr command is used with the option -r $ cpr -r ckptSep7 • If the restart involves more than one process, all restarts must succeed before any process can run; otherwise all restarts fail.

  21. IRIX Operating System (SGI) Checkpointable & Non-Checkpointable Objects • Checkpointable objects(objects that are checkpoint safe) • Process set ID • User memory (data, text, stack) • Kernel execution state ( e.g., signal mask, scheduling information, current and root directory) • System calls • Undelivered and queued signals • List of open files and devices • Pipeline setup and shared memory • Non-Checkpointable objects (objects that are not checkpoint safe) • Network sockets connections • X terminals and X11 client sessions • Graphic state • File pointers to mounted CD-ROM(s)

  22. IRIX Operating System (SGI) Application Handling of Non-Checkpointable Objects • To handle non-checkpoinable objects (e.g., network sockets, file pointers to mounted CD-ROM(s)), an application needs to: • Add an event handler to catch signals SIGCKPT & SIGRESTART • Run signal handlers to disconnect any open socket (or close open cdFiles and unmount the CD-ROM) before checkpoint and reconnect the socket (or mount the CD-ROM and reopen the cdFiles) after restart. • Two functions are provided for applications to add cpr event handlers: • atcheckpoint(my_cpt_handler())adds the application’s checkpoint handling function to the list of functions that get called upon receipt of SIGCKPT • atrestart(my_callback()) registers the application’s callback function for executing upon receipt of SIGRESTART.

More Related