Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing

Design of Reliable Systems and NetworksECE 442 Checkpointing II Distributed Checkpointing & Recovery Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND

Outline • Checkpoint and recovery in distributed/networked systems • Global consistent state • Recovery line • Synchronous checkpointing and recovery

Recovery in Distributed/Networked Systems • Processes cooperate by exchanging information to accomplish a task • Message passing (distributed systems) • Shared memory (e.g., multiprocessor systems) • Rollback of one process may require that other processes also roll back to an earlier state. • All cooperating processes need to establish recovery points. • Rolling back processes in concurrent systems is more difficult than for a single process due to • Domino effect • Lost messages • Livelocks

Domino Effect x3 x1 x2 X y1 y2 Y z2 z1 Z Time X, Y, Z - cooperating processes • Rollback of X does not affect other processes. • Rollback of Z requires all three processes to roll back to their very first recovery points. [ - recovery points

Lost Messages and Livelocks Lost Messages X x1 Time Message loss due to rollback recovery m failure y1 X Y Livelocks Livelock is a situation in which a single failure can cause an infinite number of rollbacks, preventing the system from making progress x1 x1 x1 Time Time Time X X X n2 n1 n1 m1 y1 y1 m2 y1 Y Y X Y  failure roll back to restore consistent global state

Consistent Set of Checkpoints: Recovery Lines • A strongly consistent set of checkpoints (recovery line) corresponds to a strongly consistent global state. • there is one recovery point for each process in the set during the interval spanned by checkpoints, there is no information flow between any • pair of processes in the set • a process in the set and any process outside the set • A consistent set of checkpoints corresponds to a consistent global state. Set {x1, y1, z1} is a strongly consistent set of checkpoints Set {x2, y2, z2} is a consistent set of checkpoints (need to handle lost messages) x1 x2 X y1 y2 Y z2 z1 Z Time

Networked/Distributed Systems: Local State • For a site (computer, process) Si, its local state LSi, at a given time is defined by the local context of the distributed application. Let’s denote: send(mij)- send event of a message mij by Si to Sj rec(mij)- receive event of message mij by site Sj time(x) - time in which state x was recorded • We say that send(mij)  LSi iff time(send(mij)) < time(LSi) rec(mij)  LSj iff time(rec(mij)) < time(LSj) • Two sets of messages are defined for sites Si and Sj • Transit transit(LSi, LSj) = {mij | send(mij)  LSi  rec(mij)  LSj} • Inconsistent inconsistent (LSi, LSj) = {mij | send(mij) LSi rec(mij)  LSj}

Networked/Distributed Systems: Global State • A global state (GS) of a system is a collection of the local states of its sites, i.e., GS = {LS1, LS2, …, LSn}, where n is the number of sites in the system. • Consistent global state: A global state GS is consistent iff i, j : 1  i, j  n :: inconsistent(LSi, LSj) =  • Transitless global state: A global state GS transitless iff i, j : 1  i, j  n :: transit(LSi, LSj) =  • Strongly consistent global state: A global state that is both consistent and transitless For every received message a corresponding send event is recorded in the global state All communication channels are empty

Networked/Distributed SystemsLocal/Global State - Examples The global states: • GS1 = {LS11, LS21, LS31} is a strongly consistent global state. • GS2 = {LS12, LS23, LS33} is a consistent global state . • GS3 = {LS11, LS22, LS32} is an inconsistent global state. LS11 Time LS12 S1   LS21 LS23 LS22 S2    LS31 LS33 LS32 S3   

Synchronous Checkpointing and Recovery (Koo & Toueg) • Assumptions • Processes communicate by exchanging messages through communication channels • Channels are FIFO • End-to-end protocols (such a sliding window) are assumed to cope with message loss due to rollback recovery and communication failures • Communication failures do not partition the network • A single process invokes the algorithm • The checkpoint and the rollback recovery algorithms are not invoked concurrently.

Synchronous Checkpointing and Recovery (Koo & Toueg) • Two types of checkpoints • Permanent - a local checkpoint at a process • Tentative - a temporary checkpoint that is made a permanent checkpoint on the successful termination of the checkpoint algorithm

Checkpoint Algorithm • Phase One • Initiating process Pitakes a tentative checkpoint and requests that all the processes take tentative checkpoints. • Each process informs Piwhether it succeeded in taking a tentative checkpoint. • If Pi learns that all processes have taken tentative checkpoints, Pidecides that all tentative checkpoints should be made permanent. • Otherwise, Pi decides that all tentative checkpoints should be discarded. • Phase Two • Pi propagates its decision to all processes. • On receiving the message from Pi, all processes act accordingly.

Checkpoint Algorithm (cont.) • Optimization of the checkpoint algorithm • A minimal number of processes take checkpoints • Processes use a labeling scheme to decide whether to take a checkpoint. • All processes from which Pi has received messages after it has taken its last checkpoint take a checkpoint to record the sending of those messages Tentative checkpoint x1 x2 Time X Messages to take a checkpoint m y1 y2 Y z2 z1 Z

  m.l. if m exists  otherwise m.l. if m exists  otherwise last_label_rcvdX[Y] = first_label_sentX[Y] = Labeling Scheme • Each process uses monotonically increasing labels in its outgoing messages • For any two processes X and Y, let m be the last message that X received from Y after X has taken its last permanent or tentative checkpoint then • Let m be the first message that X sent to Y after X took its last permanent or tentative checkpoint then  = smallest label T = largest label

Labeling Scheme (cont.) • When X requests Y to take a tentative checkpoint, X sends last_label_rcvdX[Y]along with its request; Y takes a tentative checkpoint only if last_label_rcvdX[Y]  first_label_sentY[X] >  • Checkpoint cohort - Set of all processes that should be asked to take a checkpoint initiated by X ckpt_cohortX = {Y | last_label_rcvdX[Y] >  } • The checkpoint at X has recorded the receipt of one or more messages sent by Y after Y took its last checkpoint; • Y should take a checkpoint to record the events that send those messages • All processes from which X has received messages after it has taken its last checkpoint • Those processes take a checkpoint to record the sending of those messages

Rollback Recovery Algorithm • Phase One: • Process Pichecks whether all processes are willing to restart from their previous checkpoints. • A process may reply “no” if it is already participating in a checkpointing or recovering process initiated by some other process. • If all processes are willing to restart from their previous checkpoints, Pi decides that they should restart. • Otherwise, Pi decides that all the processes continue with their normal activities. • Phase Two: • Pipropagates its decision to all processes. • On receiving Pi’s decision, the processes act accordingly.

Rollback Recovery Algorithm (cont.) • Optimization • A minimum number of processes roll back • Processes use a labeling scheme to decide whether they need to roll back • Y will restart from its permanent checkpoint only if X is rolling back to a state where the sending of one or more messages from X to Y is being undone x1 x2 Time X X Failure y1 y2 Y z2 z1 Z

 m.l if m exists T otherwise last_label_sentX[Y] = Labeling Scheme - extension • For any two processes X and Y, let m be the last message that X sent to Y before its last permanent checkpoint. Then • When X request Y to restart from the permanent checkpoint, it sends last_label_sentX[Y]along with this request • Y will restart from its permanent checkpoint only if last_label_rcvdY[X] > last_label_sentX[Y] X is rolling back to a state where the sending of one or more messages from X to Y is being undone

Synchronous CheckpointingDisadvantages • Additional messages must be exchanged to coordinate checkpointing. • Synchronization delays are introduced during normal operations. • No computational messages can be sent while the checkpointing algorithm is in progress. • If failure rarely occurs between successive checkpoints, then the checkpoint algorithm places an unnecessary extra load on the system, which can significantly affect performance.

IRIX Operating System (SGI) Checkpoint and Restart • Facility for saving running process(es) and, at some other time, restarting the saved process(es) from the point already reached, without starting all over again. • A checkpoint image is saved in a set of disk files and can comprise • A set of processes (one or more), e.g., $ cpr -c ckptSep7 -p 1234 where cpr-c is the checkpoint command, ckptSep7 is the statefile name, -p option allows to specify a process ID • All processes in the process group (a set of processes that constitute a logical job) • All processes in a process session (a set of processes started from the same physical or logical terminal) • All processes in an IRIX array session (a set of related processes running on different nodes in an array) • The array service daemon supports checkpointing across the nodes. • To restart a set of processes the cpr command is used with the option -r $ cpr -r ckptSep7 • If the restart involves more than one process, all restarts must succeed before any process can run; otherwise all restarts fail.

IRIX Operating System (SGI) Checkpointable & Non-Checkpointable Objects • Checkpointable objects(objects that are checkpoint safe) • Process set ID • User memory (data, text, stack) • Kernel execution state ( e.g., signal mask, scheduling information, current and root directory) • System calls • Undelivered and queued signals • List of open files and devices • Pipeline setup and shared memory • Non-Checkpointable objects (objects that are not checkpoint safe) • Network sockets connections • X terminals and X11 client sessions • Graphic state • File pointers to mounted CD-ROM(s)

IRIX Operating System (SGI) Application Handling of Non-Checkpointable Objects • To handle non-checkpoinable objects (e.g., network sockets, file pointers to mounted CD-ROM(s)), an application needs to: • Add an event handler to catch signals SIGCKPT & SIGRESTART • Run signal handlers to disconnect any open socket (or close open cdFiles and unmount the CD-ROM) before checkpoint and reconnect the socket (or mount the CD-ROM and reopen the cdFiles) after restart. • Two functions are provided for applications to add cpr event handlers: • atcheckpoint(my_cpt_handler())adds the application’s checkpoint handling function to the list of functions that get called upon receipt of SIGCKPT • atrestart(my_callback()) registers the application’s callback function for executing upon receipt of SIGRESTART.

Prof. Ravi K. Iyer Center for Reliable and High-Performance Computing