441 likes | 1.07k Views
Checkpointing and Recovery. Purpose. Consider a long running application Regularly checkpoint the application Expensive task In case of failure, restore to the previous checkpoint What happens in case of a distributed application One (or more) processes fail
E N D
Purpose • Consider a long running application • Regularly checkpoint the application • Expensive task • In case of failure, restore to the previous checkpoint • What happens in case of a distributed application • One (or more) processes fail • Restoration to previous checkpoint should be done consistently
What to Save? • Depends on application • Could be as simple as just program counter information • Could be the state of the entire process, including messages received, etc
Stable Storage • Checkpoints must survive failure of processes (including failure during a disk write) • A simple approach for stable storage
Approaches • Asynchronous • The local checkpoints at different processes are taken independently • Synchronous • The local checkpoints at different processes are coordinated • They may not be at the same time
Asynchronous Checkpointing • Problem • Domino effect Failed process
Other Issues with Asynchronous Checkpointing • Useless checkpoints • Need for garbage collection • Recovery requires significant coordination
Asynchronous Checkpointing (Continued) • Identify dependency between different checkpoint intervals • This information is stored along with checkpoints in a stable storage • When a process repairs, it requests this information from others to determine the need for rollback
Two Examples of Asynchronous Checkpointing • Bhargava and Lian • Wang et al
Algorithm by Bhargava et al • Draw an edge from ci, x to cj,y if either • i = j and y = x+1 • i j and a message m is sent from Ii, x and received in Ij, y • Where Ii, x is the interval between ci, x-1 and ci, x • Rollback recovery line used for recovery as well as garbage collection
Algorithm by Wang et al • Difference • If a message sent from Ii, x is received in Ij, y then draw an edge between cj, x-1 to cj, y • Recovery line obtained is similar to that by Bhargava and Lian • Advantage • Number of useful checkpoints is at most N(N+1)/2 • This can be shown that the number of checkpoints that are ahead of recovery line
Coordinated Checkpointing • Using diffusing computation • How can we use diffusing computation to obtain a consistent snapshot?
Algorithm by Tamir and Sequin • Blocking checkpoint • A coordinator decides when a checkpoint is taken • Coordinator sends a request message to all • Each process • Stops executing • Flushes the channels • Takes a tentative checkpoint • Replies to coordinator • When all processes send replies, the coordinator asks them to change it to a permanent checkpoint
Algorithm by Tamir and Sequin • How many checkpoints need to be stored per process?
Tamir and Sequin assume fully connected graph? • How would you do it if it was not fully connected? • Use diffusing computation • Each node stops `original computation’ when it prorogates the diffusing computation • Each node takes tentative checkpoint at completion • Channel flushing achieved in between
Checkpointing in Timed Systems • If perfectly synchronized clocks?
Checkpointing in Timed Systems • What if clocks are loosely synchronized? • Max clock drift, , is known? • All processes take a checkpoint at a fixed (local) time • After the checkpoint, a process does not send any messages for 2 • The set of local checkpoints is guaranteed to be consistent
Minimal Checkpoint Coordination • Approach by Koo and Toueg • Require processes to take a checkpoint only if they have to
Logging Protocols • Pessimistic • Optimistic • Causal
Concept of Logging • If restarted process was guaranteed to behave like it would before failure then other processes need not be aborted. • Log non-deterministic events
Definitions • Depend(m) • Processes that depend on m • Stable(m) • m stored on stable storage • Log(m) • Processes that have logged m • C • Set of failed processes
Pessimistic Protocols • Not Stable(m) => |Depend(m)| = 0 • What if • Not Stable(m) => |Depend(m)| <= 1
Causal Protocols • Save m on volatile memory of other processes • Ensure • Depend(m) Log(m)