Fault Tolerance I

Fault Tolerance I CSE5306 Lecture Quiz due 17 July 2014

Fault Tolerance • Single-machine centralized systems go down, when essential parts fail. • A distributed system is said to “tolerate a fault,” if it recovers and continues to perform, while its faulty part is being repaired.

R U O K ? 1. What does “tolerate a fault“ mean? • A centralized system limping along after a debilitating failure. • Continuing to perform, except for an unimportant requirements violation caused by a minor failure. • A distributed system recovering and continuing to perform, while its faulty part is being repaired. • All of the above. • None of the above.

Basic Concepts of Fault Tolerance • A distributed system is “dependable,” if it is… • Available: ready for immediate use, MTTF/(MTTF+MTTR). • Reliable: works well, except during maintenance 1-2AM. • Safe: nuclear power plant controller failure does not cause catastrophe. • Maintainable: easily repaired. • Secure: failure does not expose users’ secrets. • Vocabulary words: • Failure: performance that violates system requirements. • Error: a component’s unexpected state that leads to failure. • Fault: a transient (bird strike), intermittent (loose connector) or permanent (burned out chip) cause of an error.

R U O K ? Match the following terms with their definitions or examples below. 2. Dependable __ 3. Available __ 4. Reliable __ 5. Safe __ 6. Maintainable __ 7. Secure __ 8. Failure __ 9. Error __ 10. Fault __ • Nuclear power plant controller failure does not cause catastrophe. • Works well, except during regularly scheduled maintenance 1-2AM. • Ready for immediate use, MTTF/(MTTF+MTTR). • Easily repaired. • Available, reliable, safe, maintainable and secure. • Any performance that violates system requirements. • Failure does not expose users’ secrets. • A component’s unexpected state that leads to failure. • A transient, intermittent or permanent cause of an error.

Failure Models fatal • Crash failure: nothing to do but reboot. • Omission f.: no transport layer, no listening thread, send buffer over flow, infinite loop, scrambled dialog. • Timing f.: server’s late response drops connection, client responds before receive buffer allocation. • Response f.: Web search for beagles returns cats. • State transition f.: server takes default action after reasonable request. • Arbitrary (Byzantine) f.: insecure server sends deliberately wrong answers. • Fail-stop f.: clearly visible to other processes, after a warning perhaps. • Fail-silent systems: marginal performance; e.g., slow responses. • Fail-safe faults: pretends to perform, but close analysis reveals nonsense. serious annoying

R U O K ? Match the following terms with their definitions or examples below. 11. Crash failure __ 12. Omission f. __ 13. Timing f. __ 14. Response f. __ 15. State transition f. __ 16. Arbitrary (Byzantine) f. __ 17. Fail-stop f. __ 18. Fail-silent systems __ 19. Fail-safe faults __ • A server takes default action after reasonable request. • No transport layer, no listening thread, send buffer over flow, infinite loop, scrambled dialog. • Nothing to do but reboot. • Marginal performance; e.g., slow responses. • Pretends to perform, but close analysis reveals nonsense. • Insecure server sends deliberately wrong answers. • Clearly visible to other processes, after a warning perhaps. • Web search for beagles returns cats. • Server’s late response drops connection, client responds before receive buffer allocation.

Failure Masking by Redundancy • Fault tolerant systems hide failures; e.g., by 3-way voting on every decision (“triple modular redundancy” above). • Redundancy: • Information: Hamming code error correcting bits. • Time: hide transient/intermittent faults by aborting transaction and trying again. • Physical: hospital can run on batteries till its diesel generators start.

R U O K ? Match the following terms with their definitions or examples below. 20. Triple modular redundancy __ 21. Redundancy types __ 22. Information __ 23. Time __ 24. Physical __ • Hospital runs on batteries till its diesel generators start. • Hamming code error correcting bits. • A fault tolerant system hiding failures by 3-way voting on every decision. • Information, time and physical. • Hide transient/intermittent faults by aborting transaction and trying again.

Process Resilience • Collaborative groups of k+1 members can tolerate k crashes (p.331). • Byzantine groups of 3k+1 members can tolerate k lies (Fig. 8-5, p.333). • Processors of different administrative domains are “BAR fault tolerant”; i.e., Byzantine, altruistic and rational. (Their management is beyond the scope of this course, p.335.)

R U O K ? 25. Which of the following is true of fault tolerant group decision making? • Collaborative groups of k+1 members can tolerate k crashes. • Byzantine groups of 3k+1 members can tolerate k lies. • Processors of different administrative domains are “BAR fault tolerant.” • All of the above. • None of the above.

Design Issues • Tolerate a faulty process by organizing several identical processes into a group, in which all members receive the group’s messages. • A process may be a member of many groups, and join or drop out as needed. • Clients who rely upon a group’s services don’t know the members or how many there are.

R U O K ? 26. Which of the following accurately characterizes collaborative groups? • They tolerate faulty processes by organizing identical processors into a group, in which all members receive the group’s messages. • A process may be a member of many groups, and join or drop out as needed. • Clients who rely upon a group’s services don’t know the members or how many there are. • All of the above. • None of the above.

Flat Groups vs. Hierarchical Groups • Flat: all members are equal; symmetry with no single point of failure; voting takes time. • Hierarchical: director assigns specialists; director issingle point of failure; her quick decisions don’t distract specialists.

R U O K ? Match the following group attributes with the group types below. 27. Director assigns specialists __ 28. All members are equal __ 29. Director is single point of failure __ 30. Symmetrical __ 31. No single point of failure __ 32. Voting takes time __ 33. Quick decisions don’t distract specialists__ • Flat. • Hierarchical.

Group Membership • A group server uses its calling lists of “first responders” to muster groups (e.g., “Tiger Teams”), as needs arise. • Problem: the group server is a group management single point of failure. • Solution: group members can manage themselves by multicasting their joining message, and by leaving (becoming unresponsive) when needed elsewhere. • Joiners must receive the group’s legacy messages; leavers must not receive group messages. • Protocols must exist for… • Reconstituting a group that loses too many members. • Arbitrating between two contenders for leadership.

R U O K ? 34. Which of the following is a group server’s responsibility? • Mustering new groups as needed. • Reconstituting a group that loses too many members. • Arbitrating among contenders for leadership. • All of the above. • None of the above.

Failure Masking and Replication • Primary-based replication: if a sound statistical analysis shows that a replicated system must tolerate k crashes, • then k backup systems must be ready for election as the primary system’s replacement (e.g., the Catholic Pope). • Flat groups use replicated-write and quorum-based protocols to coordinate groups of identical backup processes. • To tolerate k “sick” processes (i.e., Byzantine failures), 3k+1 flat group members are required. (The 2k+1 honest processes must out-vote the k liars.)

Agreement in Faulty Systems • Distributed agreement algorithms seek consensus among processes in a limited number of steps, under the following assumptions: • All processes march together (synchronous) or not. • All messages arrive within maximum time or not. • Processes’ messages are natural ordered (TCP) or not. • Unicasting (separate messages) or multicasting. • Agreement is possible in only half of these combinations (see above, not Fig. 8-4, p.333).

R U O K ? • 35. Which of the following describes assumptions that a distributed agreement algorithm designer must make? • All messages arrive within maximum time or not. • Processes’ messages are natural ordered (TCP) or not. • Unicasting (separate messages) or multicasting. • All of the above. • None of the above.

Agreement in Byzantine Systems • How can a group of 4 agree, in spite of one sick member (see a above)? (Assume synchronous, unicasts and bounded message delays.) The 3-step solution: • Honest nodes send their node numbers to all others; liar sends correct node number only to herself (b above). • Every node sends a vector of everything she received to everyone else (c above). • Every node sees an accurate majority vote in every matrix column.

R U O K ? 36. How can a group of 3k+1 Byzantine group members reach an agreement, in spite of its k lying members? • Honest nodes send their node numbers to all others; liars sends correct node number only to themselves. • Every node sends a vector of everything she received to everyone else. • Every node sees an accurate majority vote in every matrix column. • All of the above. • None of the above.

Failure Detection • How do you know if a server is alive? • Ping it. If no response till time expires, assume it is dead. But… maybe the network is unreliable, and pinging is crude. • If ping times out, ask node to ping via other path. • Gossiping (saying, “I’m alive”) is more reliable. • Regularly exchange information with neighbors. • When ping times out, honor your fallen comrade by failing to ACK too, till the whole group dies!

R U O K ? 37. How do you find out if a server is alive? • Ping it, and listen for “I’m alive.” • If ping times out, ask another node to ping via its alternate path. • Regularly exchange information with neighbors. • All of the above. • None of the above.

Reliable Point-to-Point Client-Server Communications • Communication failures: • Crash: server can attempt to set up a new connection, if client drops. • Omission: TCP masks lost messages by NAKing and getting message sent again. • Timing: late message deliveries. • Arbitrary: network may send an old buffered message after sender resends it.

R U O K ? Match the following communication failures with their definitions or remedies below. 38. Crash __ 39. Omission __ 40. Timing __ 41. Arbitrary __ • Network may send an old buffered message, after sender resends it. • Late message deliveries. • Server can attempt to set up a new connection, if client drops. • TCP masks lost messages by NAKing and getting message sent again.

RPC Semantics in the Presence of Failures • Five different failures foil systems’ attempts to hide communications and make RPCs appear local: • The client cannot locate the server. • The client-to-server request message is lost. • The server crashes after receiving a request. • The server-to-client reply message is lost. • The client crashes after sending a request.

R U O K ? 42. What communications failures can foil systems’ attempts to make RPCs appear local? • The client cannot locate the server. • The client-to-server request message is lost. • The server crashes after receiving a request. • All of the above. • None of the above.

Client Cannot Locate Server • Reasons why client can’t reach server: • Server is down. • Client’s interface protocol is obsolete. • What to do about it: raise an exception. • Drawbacks: • Languages disagree on how to handle exceptions. • Destroys illusion that the RPC is local.

R U O K ? 43. What is wrong with raising an exception, when a RPC client can’t reach server? • Languages disagree on how to handle exceptions. • It destroys illusion that the RPC is local. • All of the above. • None of the above.

Lost Request Messages • What to do about it: when reasonable response time expires, resend message. • Drawbacks: • When resent messages get lost too, see “Client Cannot Locate Server” above. • When message response is too slow (message not lost), server must deal with duplicated messages (see “Lost Reply Messages” below).

Server Crashes • Servers crash (and fail to reply) after or before executing the requested process (b and c above). • What to do: • Wait till server reboots and call again; i.e., “at least once semantics.” • Give up and report a failure; i.e., “at most once semantics.” • Do nothing; i.e., don’t help or even explain. • What if server crashes before printing a large file? • Client can resend request, risking printing two copies. • Client doesn’t resend request, risking getting no print out. • Client can resend, if its request is not ACKed. • Client can resend, if server did not say, “Print out is ready.”

Server Crashes (continued) • What if print server crashes before (or after) printing a large file? • These events can occur in six different orderings (didn’t happen): • M →P →C: A crash occurs after sending the completion message and printing the text. • M →C (→P): A crash happens after sending the completion message, but before the text could be printed. • P →M →C: A crash occurs after sending the completion message and printing the text. • P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. • C (→P →M): A crash happens before the server could do anything. • C (→M →P): A crash happens before the server could do anything. • See the client’s possible/necessary responses and outcomes above.

Lost Reply Messages • Safely repeated requests are idempotent; e.g., resend a file block, not retransfer $1000. • It is safest to assume that no request is idempotent: • Mark all requests with sequence numbers to distinguish originals from repeats. • Set a bit in the header of repeated requests, so that the server can handle it with care, which depends upon circumstances.

R U O K ? 44. What can you do to help safeguard against lost reply messages? • Assume that no request is idempotent. • Mark all requests with sequence numbers to distinguish originals from repeats. • Set a bit in the header of repeated requests, so that the server can handle it with care. • All of the above. • None of the above.

Client Crashes • Un-received server responses are “orphans”: • They waste CPU cycles, lock files and use resources. • Their premature arrival after client reboots can be confusing. • What to do about orphans? • Log every step, and read log after reboot. If it shows request was issued, kill the orphan. • Broadcast every step completion and broadcast reboot message. Let listeners kill the orphans. • Upon receiving reboot messages, others try to locate parents. If they are dead, the orphans die. • Orphans die, when client’s response times out. • Killing orphans can have lasting undesired side effects.

R U O K ? 45. Why should you care about “orphans” (i.e., unreceivedserver responses)? • They waste CPU cycles and use resources. • Their premature arrival after client reboots can be confusing. • Even if killed without mercy, they can leave devastating lasting effects; e.g., locked files. • All of the above. • None of the above.

Reliable Group Communication • Reliable multicast services are as important as resilient process replication. • Multicasts should guarantee deliveries to all members of a group. • But that ain’t easy…!

Basic Reliable-Multicasting Schemes • TCP only guarantees point-to-point deliveries. • Broadcasting via point-to-point connections is efficient for a few group members (see above). • Sequence numbers on every broadcast message prompt receivers to NAK missing messages. (Sender retains each message till every receiver ACKs.)

R U O K ? 46. Which of the following describe basic reliable multicasting? • TCP only guarantees point-to-point deliveries. • Broadcasting via point-to-point connections is efficient for relatively few group members. • Sequence numbering broadcast messages enables receivers to NAK missing messages. • All of the above. • None of the above.

Scalability in ReliableMulticasting • Receivers sending a few NAKs but not a lot of ACKs, scales up to larger groups. • Server’s deleting an old message risks the possibility that some receiver still has not received it.

Nonhierarchical Feedback Control • The Scalable Reliable Multicasting protocol does just the right amount of feedback suppression. • When a receiver misses a message, it multicasts its NAK (see above), which suppresses all others’ NAKs. • NAK collisions are prevented by randomly delaying the NAK while listening for others’ NAKs, as in the Ethernet protocol. • WANs with long propagation delays can’t do this very well. Neighboring nodes can team up on NAKing, by communicating with each other via a separate channel.

R U O K ? 47. Which of the following accurately characterizes the Scalable Reliable Multicasting protocol doing just the right amount of feedback suppression? • When a receiver misses a message, it multicasts its NAK, which suppresses all others’ NAKs. • NAK collisions are prevented by the receiver’s randomly delaying its NAK while listening for others’ NAKs, as in the Ethernet protocol. • WANs with long propagation delays can’t do this very well, but neighboring nodes can team up on NAKing, by communicating with each other via a separate channel. • All of the above. • None of the above.

Hierarchical Feedback Control • Hierarchical groups scale better than flat ones. • Sender sends to roots of large spanning trees. • Root’s local coordinators buffer and relay messages, as well as handle their subgroups’ ACKs and NAKs. • Application-level multicasting (pp.166-170) can solve the hierarchical subgroups’ dynamic growth and contraction problems.

R U O K ? 48. Which of the following is a reason why hierarchical groups scale better than flat ones. • Sender sends to roots of large spanning trees. • Roots’ local coordinators buffer and relay messages, as well as handle their subgroups’ ACKs and NAKs. • Application-level multicasting can solve the hierarchical subgroups’ dynamic growth and contraction problems. • All of the above. • None of the above.

Fault Tolerance I

Fault Tolerance I

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance