Checkpointing and Logging: Ensuring Secure and Dependable Computing

EEC 688/788Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org

Outline Checkpointing and logging Checkpoint-based protocols Uncoordinted checkpointing Coordinated checkpointing Logging-based protocols Pessimistic logging Optimistic logging Causal logging EEC688/788: Secure & Dependable Computing

Chandy and Lamport Distributed Snapshot Protocol CL snapshot protocol is a nonblocking protocol TS checkpointing protocol is blocking CL protocol is more desirable for applications that do not wish to suspect normal operation However, CL protocol is only concerned how to obtain a consistent global checkpoint CL Protocol: no coordinator, any node may initiate a global checkpointing Data structure Marker message: equivalent to the CHECKPOINT message Marker certificate: keep track to see a marker is received from every incoming channel

CL Distributed Snapshot Protocol

Example P1->P0 channel state: m0 P2->P1 channel state: m1 All other channel states: empty

Comparison of TS & CL Protocols Similarity Both rely on control msgs to coordinate checkpointing Both capture channel state in virtually the same way Start logging channel state upon receiving the 1st checkpoint msg from another channel Stop logging channel state after received checkpoint on the incoming channel Communication overhead similar

Comparison of TS & CL Protocols Differences: strategies in producing a global checkpoint TS protocol suspends normal operation upon 1st checkpoint msg while CL does not TS protocol captures channel state prior to taking a checkpoint, while CL captures channel state after taking a checkpoint TS protocol more complete and robust than CL Has fault handling mechanism

Log Based Protocols Work might be lost upon recovery using checkpoint-based protocols By logging messages, we may be able to recover the system to where it was prior to the failure System mode: the execution of a process is modeled as a set of consecutive state intervals Each interval is initiated by a nondeterministic state or initial state We assume the only type of nondeterministic event is receiving of a message

Log Based Protocols In practice, logging is always used together with checkpointing Limits the recovery time: start with the latest checkpoint instead of from the initial state Limits the size of the log: after taking a checkpoint, previously logged events can be purged Logging protocol types: Pessimistic logging: msgs are logged prior to execution Optimistic logging: msgs are logged asynchronously Causal logging: nondeterministic events that not yet logged (to stable storage) are piggybacked with each msg sent For optimistic and causal logging, dependency of processes has to be tracked => more complexity, longer recovery time

Pessimistic Logging Synchronously log every incoming message to stable storage prior to execution Each process periodically checkpoints its state: no need for coordination Recovery: a process restores its state using the last checkpoint and replay all logged incoming msgss

Pessimistic Logging: Example Pessimistic logging can cope with concurrent failures and the recovery of two or more processes

Benefits of Pessimistic Logging Processes do not need to track their dependencies Logging mechanism is easy to implement and less error prone Output commit is automatically ensured No need to carry out coordinated global checkpointing By replaying the logged msgs, a process can always bring itself to be consistent with other processes Recovery can be done completely locally Only impact to other processes: duplicate msgs (can be discarded)

Pessimistic Logging: Discussion Reconnection A process must be able to cope with temporary connection failures and be ready to accept reconnections from other processes Application logic should be made independent from the transport level events: event-based or document-based computing paradigm Message duplicate detection Messages may be replayed during recovery => duplicate messages Transport level duplicate detection irrelevant. Must add mechanism in application level protocols, e.g., WS-ReliableMessaging Atomic message receiving and logging A process may fail right after the receiving of a message before it has a chance to log it to stable storage Need application-level reliable messaging mechanism

Application-Level Reliable Messaging Sender buffers message sent until receives an application-level ack Benefits of application-level reliable messaging Atomic message receiving and logging Facilitate distributed system recovery from process failures: enables reconnection Enables optimization: message received can be executed immediately and the logging can be deferred until another message is to be sent Logging and msg execution can be done concurrently If a process sends out a message after receiving several msgs, logging of msgs can be batched

Sender Based Message Logging Basic idea Log the message at the sending side in volatile memory Should the receiving process fail, it could obtain the messages logged at the sending processes for recovery. To avoid restarting from the initial state after a failure, a process can periodically checkpoint its local state and write the message log in stable storage (as part of the checkpoint) asynchronously Tradeoff Relative ordering of messages must be explicitly supplied by the receiver to the sender (quite counter-intuitive!) The receiver must wait for an explicit ack for the ordering message before it send any msgs to other processes (however, it can execute the message received immediately without delay) The mechanism is to prevent the formation of orphan messages and orphan processes

Orphan Message and Orphan Process An orphan message is one that was sent by a process prior to a failure, but cannot be guaranteed to be regenerated upon the recovery of the process An orphan process is a process that receives an orphan message If a process sends out a message and subsequently fails before the determinants of the messages it has received are properly logged, the message sent becomes an orphan message

Exercise 1. Identify the set of most recent checkpoints that can be used to recover the system shown here after the crash of P1 12/20/2019 EEC693: Secure and Dependable Computing EEC688: Secure & Dependable Computing Wenbing Zhao

Exercise 2.Chandy and Lamport distributed snapshot protocol is used to produce a consistent global state of the system shown below. Draw all control msgs sent in the CL protocol, the checkpoints taken at P1 and P2, and specify the channel state for the P0 to/from P1 channels, the P1 to/from P2 channels, and P2 to/from P0 channels 12/20/2019 EEC688: Secure & Dependable Computing Wenbing Zhao

Checkpointing and Logging: Ensuring Secure and Dependable Computing

Checkpointing and Logging: Ensuring Secure and Dependable Computing

Presentation Transcript