Fault Tolerance II

Fault Tolerance II Quiz 22 due at 5 PM Saturday, 18 October 2014

Lost Reply Messages • Safely repeated requests are idempotent; e.g., resend a file block, not retransfer $1000. • It is safest to assume that no request is idempotent: • Mark all requests with sequence numbers to distinguish originals from repeats. • Set a bit in the header of repeated requests, so that the server can handle it with care, which depends upon circumstances.

R U O K ? 1. What can you do to help safeguard against lost reply messages? • Assume that no request is idempotent. • Mark all requests with sequence numbers to distinguish originals from repeats. • Set a bit in the header of repeated requests, so that the server can handle it with care. • All of the above. • None of the above.

Client Crashes • Un-received server responses are “orphans”: • They waste CPU cycles, lock files and use resources. • Their premature arrival after client reboots can be confusing. • What to do about orphans? • Log every step, and read log after reboot. If it shows request was issued, kill the orphan. • Broadcast every step completion and broadcast reboot message. Let listeners kill the orphans. • Upon receiving reboot messages, others try to locate parents. If they are dead, the orphans die. • Orphans die, when client’s response times out. • Killing orphans can have lasting undesired side effects.

R U O K ? 2. Why should you care about “orphans” (i.e., unreceivedserver responses)? • They waste CPU cycles and use resources. • Their premature arrival after client reboots can be confusing. • Even if killed without mercy, they can leave devastating lasting effects; e.g., locked files. • All of the above. • None of the above.

Reliable Group Communication • Reliable multicast services are as important as resilient process replication. • Multicasts should guarantee deliveries to all members of a group. • But that ain’t easy…!

Basic Reliable-Multicasting Schemes • TCP only guarantees point-to-point deliveries. • Broadcasting via point-to-point connections is efficient for a few group members (see above). • Sequence numbers on every broadcast message prompt receivers to NAK missing messages. (Sender retains each message till every receiver ACKs.)

R U O K ? 3. Which of the following describe basic reliable multicasting? • TCP only guarantees point-to-point deliveries. • Broadcasting via point-to-point connections is efficient for relatively few group members. • Sequence numbering broadcast messages enables receivers to NAK missing messages. • All of the above. • None of the above.

Scalability in ReliableMulticasting • Receivers sending a few NAKs but not a lot of ACKs, scales up to larger groups. • Server’s deleting an old message risks the possibility that some receiver still has not received it.

Nonhierarchical Feedback Control • The Scalable Reliable Multicasting protocol does just the right amount of feedback suppression. • When a receiver misses a message, it multicasts its NAK (see above), which suppresses all others’ NAKs. • NAK collisions are prevented by randomly delaying the NAK while listening for others’ NAKs, as in the Ethernet protocol. • WANs with long propagation delays can’t do this very well. Neighboring nodes can team up on NAKing, by communicating with each other via a separate channel.

R U O K ? 4. Which of the following accurately characterizes the Scalable Reliable Multicasting protocol doing just the right amount of feedback suppression? • When a receiver misses a message, it multicasts its NAK, which suppresses all others’ NAKs. • NAK collisions are prevented by the receiver’s randomly delaying its NAK while listening for others’ NAKs, as in the Ethernet protocol. • WANs with long propagation delays can’t do this very well, but neighboring nodes can team up on NAKing, by communicating with each other via a separate channel. • All of the above. • None of the above.

Hierarchical Feedback Control • Hierarchical groups scale better than flat ones. • Sender sends to roots of large spanning trees. • Root’s local coordinators buffer and relay messages, as well as handle their subgroups’ ACKs and NAKs. • Application-level multicasting (pp.166-170) can solve the hierarchical subgroups’ dynamic growth and contraction problems.

R U O K ? 5. Which of the following is a reason why hierarchical groups scale better than flat ones. • Sender sends to roots of large spanning trees. • Roots’ local coordinators buffer and relay messages, as well as handle their subgroups’ ACKs and NAKs. • Application-level multicasting can solve the hierarchical subgroups’ dynamic growth and contraction problems. • All of the above. • None of the above.

Atomic Multicast • We need to guarantee that… • In the presence of process failures, • a message delivers to all processes or none at all, • messages are delivered to all in the same order; • i.e., “atomic multicast.” • When a replica crashes (a above), it loses its group membership. (The group it abandoned is complete, so condition b above is satisfied.) • When it recovers, it must rejoin the group. (All of the messages that it missed must be received in proper order, to satisfy condition c above.)

R U O K ? 6. What is an atomic multicast? • A multicast that can perform in the presence of process failures. • It delivers each message to all or no processes. • It delivers all messages in the same order. • All of the above. • None of the above.

Virtual Synchrony • Receiving a message and message delivery are different (see above left). • If a group loses or gains a member (i.e., “view change,” vc) at the same time it receives a message, then that message must not be delivered to anyone. (Atomic multicast prohibits delivery to a nonmember and failing to deliver to a member.) • Purposefully deciding not to deliver to anyone (i.e., all members ignoring whatever fragment of the delivery all members already have seen), on the occasion of a VC, does not make multicasting unreliable. • In fact, ignoring fragments makes multicasting “virtually synchro-nous” (above right). That is, it is equivalent to the message never having been sent. VCs must be delayed till a multicast is complete.

R U O K ? 7. What is virtual synchrony? • Clearly separating message reception in the operating system from message delivery in the application layer. • Purposely delaying message deliveries till the current view change (i.e., a group’s losing or gaining a member) is completed. • Purposefully delaying view changes till pending message deliveries are completed. • All of the above. • None of the above.

Message Ordering • Messages can be reliably and virtually synchronously multicast in four different orders: • Reliable unordered multicasts—messages deliver in any order (see upper left). • R. FIFO-ordered m.—each sender’s messages deliver in the order they were sent (upper right). • R. causally-ordered m.—timestamp-ordered deliveries from all senders. • Totally-ordered m.—all messages delivered in the same order to all group members.

R U O K ? 8. In what orders can messages be reliably and virtually synchronously multicast? • Unordered multicast—messages deliver in any order. • FIFO-ordered multicast—each sender’s messages deliver in the order they were sent. • Causally-ordered multicast—timestamp-ordered deliveries from all senders. • Totally-ordered multicast—all messages delivered in the same order to all group members. • All of the above.

R U O K ? 9. What are the permissible delivery orderings for the combination of FIFO and total-ordered multicasting in Fig. 8-14, on p.352? __ • m1, m2, m3, m4. • m1, m3, m2, m4. • m1, m3, m4, m2. • m3, m1, m2, m4. • m3, m1, m4, m2. • m3, m4, m1, m2. • All of the above.

Implementing Virtual Synchrony • Reliable TCP point-to-point messaging to each group member, but not all (e.g., sender could fail halfway through). • Every member’s communication layer holds each message till it is “stable”; i.e. received by every member. Then all deliver together.

R U O K ? 10. How can reliable virtual synchrony be assured, in the event of a sender failing halfway through a group’s message delivery? • Reliable TCP point-to-point messaging delivers to each group member, but not all. • Every member’s communication layer holds each message till it is “stable”(i.e., received by every member), then all deliver together. • Both of the above. • None of the above.

Implementing Virtual Synchrony (continued) What if a processor fails in the middle of a multicast or in the middle of a view change? • Process 4 notices that process 7 has crashed and sends a view change. • Process 6 sends out all its unstable messages, followed by a flush message. • Process 6 installs the new view when it has received a flush message from everyone else.

R U O K ? 11. What if a processor fails in the middle of a multicast or in the middle of a view change? • A functional process, which notices that another process has crashed, sends a view change to all. • A third process sends its unstable (i.e., partially sent) messages to all members, followed by a flush message. • That process installs the new view, after it receives a flush message from everyone else. • All of the above. • None of the above.

Distributed Commit • “Distributed commit” is a distributed transaction in which all members complete a transaction, or none at all. • In a one-phase commit, a coordinator tells all participants to simultaneously perform the transaction. • But what if one participant crashes and cannot tell the coordinator that it was unable to perform…?

R U O K ? 12. Which of the following accurately describes a one-phase commit? • A distributed transaction in which all members complete a transaction or none at all. • One participant crashes without telling the coordinator it was unable to perform. • A coordinator tells all participants to simultaneously perform a transaction. • All of the above. • None of the above.

Two-Phase Commit • “Two-phase commit” is a distributed transaction with a 2-way handshake: • Coordinator sends VOTE_REQUEST to all participants (above left). • Each participant replies with a VOTE_COMMIT or VOTE_ABORT (above center). • If vote was unanimous, coordinator sends GLOBAL_COMMIT, else she sends GLOBAL_ABORT. • Every participant either commits the transaction or aborts it as directed. • In general, all participants block waiting for messages, until time runs out, which aborts the transaction (above right). • But what if the coordinator crashes, after sending global commit to half of the members…?

R U O K ? 13. Which of the following accurately describe a two-phase commit? • Coordinator sends VOTE_REQUEST to all participants. • Each participant replies with a VOTE_COMMIT or VOTE_ABORT. • If vote was unanimous, coordinator sends GLOBAL_COMMIT, else she sends GLOBAL_ABORT. • Every participant either commits the transaction or aborts as directed. • All of the above. • None of the above.

Fault Tolerance II

Fault Tolerance II

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault Tolerance II

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance