180 likes | 192 Views
This lecture discusses checkpoint-based and logging-based protocols in secure and dependable computing, including uncoordinated and coordinated checkpointing, as well as pessimistic, optimistic, and causal logging. It also compares the Chandy and Lamport distributed snapshot protocol with the TS checkpointing protocol. The lecture concludes with a discussion on reconnection and the importance of application-level reliable messaging for atomic message receiving and logging.
E N D
EEC 688/788Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org
Outline Checkpointing and logging Checkpoint-based protocols Uncoordinted checkpointing Coordinated checkpointing Logging-based protocols Pessimistic logging Optimistic logging Causal logging EEC688/788: Secure & Dependable Computing
Chandy and Lamport Distributed Snapshot Protocol CL snapshot protocol is a nonblocking protocol TS checkpointing protocol is blocking CL protocol is more desirable for applications that do not wish to suspect normal operation However, CL protocol is only concerned how to obtain a consistent global checkpoint CL Protocol: no coordinator, any node may initiate a global checkpointing Data structure Marker message: equivalent to the CHECKPOINT message Marker certificate: keep track to see a marker is received from every incoming channel
Example P1->P0 channel state: m0 P2->P1 channel state: m1 All other channel states: empty
Comparison of TS & CL Protocols Similarity Both rely on control msgs to coordinate checkpointing Both capture channel state in virtually the same way Start logging channel state upon receiving the 1st checkpoint msg from another channel Stop logging channel state after received checkpoint on the incoming channel Communication overhead similar
Comparison of TS & CL Protocols Differences: strategies in producing a global checkpoint TS protocol suspends normal operation upon 1st checkpoint msg while CL does not TS protocol captures channel state prior to taking a checkpoint, while CL captures channel state after taking a checkpoint TS protocol more complete and robust than CL Has fault handling mechanism
Log Based Protocols Work might be lost upon recovery using checkpoint-based protocols By logging messages, we may be able to recover the system to where it was prior to the failure System mode: the execution of a process is modeled as a set of consecutive state intervals Each interval is initiated by a nondeterministic state or initial state We assume the only type of nondeterministic event is receiving of a message
Log Based Protocols In practice, logging is always used together with checkpointing Limits the recovery time: start with the latest checkpoint instead of from the initial state Limits the size of the log: after taking a checkpoint, previously logged events can be purged Logging protocol types: Pessimistic logging: msgs are logged prior to execution Optimistic logging: msgs are logged asynchronously Causal logging: nondeterministic events that not yet logged (to stable storage) are piggybacked with each msg sent For optimistic and causal logging, dependency of processes has to be tracked => more complexity, longer recovery time
Pessimistic Logging Synchronously log every incoming message to stable storage prior to execution Each process periodically checkpoints its state: no need for coordination Recovery: a process restores its state using the last checkpoint and replay all logged incoming msgss
Pessimistic Logging: Example Pessimistic logging can cope with concurrent failures and the recovery of two or more processes
Benefits of Pessimistic Logging Processes do not need to track their dependencies Logging mechanism is easy to implement and less error prone Output commit is automatically ensured No need to carry out coordinated global checkpointing By replaying the logged msgs, a process can always bring itself to be consistent with other processes Recovery can be done completely locally Only impact to other processes: duplicate msgs (can be discarded)
Pessimistic Logging: Discussion Reconnection A process must be able to cope with temporary connection failures and be ready to accept reconnections from other processes Application logic should be made independent from the transport level events: event-based or document-based computing paradigm Message duplicate detection Messages may be replayed during recovery => duplicate messages Transport level duplicate detection irrelevant. Must add mechanism in application level protocols, e.g., WS-ReliableMessaging Atomic message receiving and logging A process may fail right after the receiving of a message before it has a chance to log it to stable storage Need application-level reliable messaging mechanism
Application-Level Reliable Messaging Sender buffers message sent until receives an application-level ack Benefits of application-level reliable messaging Atomic message receiving and logging Facilitate distributed system recovery from process failures: enables reconnection Enables optimization: message received can be executed immediately and the logging can be deferred until another message is to be sent Logging and msg execution can be done concurrently If a process sends out a message after receiving several msgs, logging of msgs can be batched
Sender Based Message Logging Basic idea Log the message at the sending side in volatile memory Should the receiving process fail, it could obtain the messages logged at the sending processes for recovery. To avoid restarting from the initial state after a failure, a process can periodically checkpoint its local state and write the message log in stable storage (as part of the checkpoint) asynchronously Tradeoff Relative ordering of messages must be explicitly supplied by the receiver to the sender (quite counter-intuitive!) The receiver must wait for an explicit ack for the ordering message before it send any msgs to other processes (however, it can execute the message received immediately without delay) The mechanism is to prevent the formation of orphan messages and orphan processes
Orphan Message and Orphan Process An orphan message is one that was sent by a process prior to a failure, but cannot be guaranteed to be regenerated upon the recovery of the process An orphan process is a process that receives an orphan message If a process sends out a message and subsequently fails before the determinants of the messages it has received are properly logged, the message sent becomes an orphan message
Exercise 1. Identify the set of most recent checkpoints that can be used to recover the system shown here after the crash of P1 12/20/2019 EEC693: Secure and Dependable Computing EEC688: Secure & Dependable Computing Wenbing Zhao
Exercise 2.Chandy and Lamport distributed snapshot protocol is used to produce a consistent global state of the system shown below. Draw all control msgs sent in the CL protocol, the checkpoints taken at P1 and P2, and specify the channel state for the P0 to/from P1 channels, the P1 to/from P2 channels, and P2 to/from P0 channels 12/20/2019 EEC688: Secure & Dependable Computing Wenbing Zhao