230 likes | 245 Views
This lecture covers the results of the midterm exam, group communication systems, membership protocols, agreed and safe delivery, and checkpointing and recovery in secure and dependable computing.
E N D
EEC 693/793Special Topics in Electrical EngineeringSecure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org
Outline • Midterm#2 result • Group communication systems • Membership protocols • Agreed and safe delivery • Checkpointing and recovery • Reference: • Reliable distributed systems, by K. P. Birman, Springer; Chapter 14-16 EEC693: Secure & Dependable Computing
Midterm#2 Result • High 98, low 79, mean 92.7 • Average Q1-18.9, Q2-17.6, Q3-18.3, Q4-19.1, Q5-18.9 EEC693: Secure & Dependable Computing
Unreliable Failure Detection • Recall that failures are hard to distinguish from network delay • So we accept risk of mistake • If p is running a protocol to exclude q because “q has failed”, all processes that hear from p will cut channels to q • Avoids “messages from the dead” • q must rejoin to participate in GMS again EEC693: Secure & Dependable Computing
Basic GMP • Someone reports that “q has failed” • Leader (process p) runs a 2-phase commit protocol • Announces a “proposed new GMS view” • Excludes q, or might add some members who are joining, or could do both at once • Waits until a majority of members of current view have voted “ok” • Then commits the change EEC693: Secure & Dependable Computing
GMP Example • Proposes new view: {p,r} [-q] • Needs majority consent: p itself, plus one more (“current” view had 3 members) • Can add members at the same time Proposed V1 = {p,r} Commit V1 p q r OK V0 = {p,q,r} V1 = {p,r} EEC693: Secure & Dependable Computing
Special Concerns? • What if someone doesn’t respond? • P can tolerate failures of a minority of members of the current view • New first-round “overlaps” its commit: • “Commit that q has left. Propose add s and drop r” • P must wait if it can’t contact a majority • Avoids risk of partitioning EEC693: Secure & Dependable Computing
What If Leader Fails? • Here we do a 3-phase protocol • New leader identifies itself based on age ranking (oldest surviving process) • It runs an inquiry phase • “The adored leader has died. Did he say anything to you before passing away?” • Note that this causes participants to cut connections to the adored previous leader • Then run normal 2-phase protocol but “terminate” any interrupted view changes leader had initiated EEC693: Secure & Dependable Computing
GMP Example p • New leader first sends an inquiry • Then proposes new view: {r,s} [-p] • Needs majority consent: q itself, plus one more (“current” view had 3 members) • Again, can add members at the same time Inquire [-p] Proposed V1 = {r,s} Commit V1 q r OK: nothing was pending OK V0 = {p,q,r} V1 = {r,s} EEC693: Secure & Dependable Computing
Safe and Agreed Delivery • For totally ordered reliable multicast, there are two delivery policies • Safe delivery: a message is delivered only when all correct processes have received it • Agreed delivery: a message is delivered as long as it is the next message in total order EEC693: Secure & Dependable Computing
Safe and Agreed Delivery • Safe delivery guarantees the uniformity of multicast: • If a message is delivered to any process, it is delivered by all correct processes • Agreed delivery does not: • It is possible that a message is delivered in one (or more) process, but is not delivered by some correct process EEC693: Secure & Dependable Computing
Checkpointing • Checkpointing: the act of taking a snapshot of an entity so that we can restore it later • A replica is a process running in an operating system. The state of a process • Processes' memory, stack and registers • Threads • Open or mmap'ed files • Current working directory • Interprocess communication: • Semaphores, shared memory, pipes, sockets • Dynamic Load Libraries • … EEC693: Secure & Dependable Computing
Checkpointing • Many tools are available to perform checkpointing transparently or semi-transparently • http://www.checkpointing.org/ • Condor, libckpt, etc. • Checkpoints taken in general are not portable • Checkpoint size might be big EEC693: Secure & Dependable Computing
Checkpointing of Application State • Sometimes it is more efficient to save and store the application state only • Checkpoints can be very portable and compact in size • class Counter { int counter; Counter(int initVal) { counter = initVal; } void increment() {counter++; } void decrement() {counter--; } void setState(int c) {counter = c; } int getState() { return counter;}|} EEC693: Secure & Dependable Computing
Logging • Logging of messages • Checkpointing in general is expensive • Logging of messages is cheaper => we can periodically do checkpointing, or do checkpointing on demand and log all messages in between • Logging of other non-deterministic activities • Access order to shared data EEC693: Secure & Dependable Computing
Recovery • Roll-backward recovery • Used primarily by transaction processing • When a failure occurs, roll back using the most recent checkpoint (and retry) • Roll-forward recovery • Used primarily in space redundancy • To recover a repaired replica, transfer the state from a current replica to the recovering replica EEC693: Secure & Dependable Computing
Roll-Forward Recovery • With replication in space, it is possible to recover a fault while the system is progressing ahead • Roll-forward recovery is made possible by • Checkpointing of replica state • Logging of incoming messages • Reliable, totally ordered group communication system EEC693: Secure & Dependable Computing
Roll-Forward Recovery • We want to ensure the newly admitted replica to have a consistent state with others when it starts • Steps of adding a new replica into a group (with on-demand checkpointing) • A recovered (or a new) replica joins a group • A join message is multicast in total order • On receiving the join message, it is put into incoming message queue and wait for processing • When the join message is at the head of the queue, a checkpoint is taken and it is transferred to the new replica EEC693: Secure & Dependable Computing
Roll-Forward Recovery • At the new replica, it starts queueing messages after it receives the join messages (sent by itself) • When the checkpoint is received by the new replica, its state is restored using the received checkpoint (the checkpoint is delivered out of order!) • The queued messages are delivered in order, at the new replica • Other replicas do not stop and wait for the new replica • Steps of adding a new replica into a group with periodic checkpointing is similar EEC693: Secure & Dependable Computing
Steps of Roll-Forward Recovery EEC693: Secure & Dependable Computing
Steps of Roll-Forward Recovery EEC693: Secure & Dependable Computing
Steps of Roll-Forward Recovery EEC693: Secure & Dependable Computing
Steps of Roll-Forward Recovery EEC693: Secure & Dependable Computing