ITEC801 Distributed Systems

ITEC801Distributed Systems Fault Tolerance Coulouris Chapter 8, 14 Fault Tolerance

Introduction • Characteristic feature of a distributed system: Partial Failure • When one component fails. • Affect proper operation of the system: Some components • In contrast, a failure in non distributed system is total. • Distributed Design Goal: Automatic recovery from partial failure without seriously affecting the overall performance. • Acceptable operation Fault Tolerance

Introduction to Fault Tolerance • Also known as fail-safe design: A design that enables a system to continue operation, possibly at a reduced level (also known as graceful degradation), rather than failing completely, when some part of the system fails. • More or less fully operational: Throughput, Response Time • Other examples • Motor vehicle • Structure • not just a property of individual machines: TCP Fault Tolerance

Key Properties Dependable System: • Availability: Property that a system is ready to be used immediately. • Reliability: Property that a system can run continuously without failure. • Safety: A situation when the system temporarily fails to operate correctly, nothing catastrophic happens. • Maintainability: How easily can be failed system be repaired. • Note Availability versus Reliability Fault Tolerance

Error and Fault • Error : part of system state that leads to a failure. • Example: Damaged packets • Fault: Cause of an error is called a fault. • Example: Bad Transmission medium Fault Tolerance

Classification of Faults • Transient: Occurs once and disappears. • Environmental conditions • Soft • Intermittent: occurs, vanishes, and then reappears. • Difficult to diagnose. • Unstable/variation • Permanent: Continues to exist • Hard. Fault Tolerance

Failure Models Fault Tolerance

Failure Models Crash Failure: Occurs when a component permanently halts. but was working correctly until it stopped. • Nothing else is heard • OS failure. Omission fault/failure A component that does not respond to an input from another component, and thereby fails by not producing the expected output is exhibiting an omission fault and the corresponding failure an omission failure. Example: A server fails to respond to a request. Fault Tolerance

Failure Models • Timing fault/failure A timing fault causes the component to respond with the correct value but outside the specified interval (either too soon, or too late). The corresponding failure is a timing failure. • Example: Overloaded server processing slowly. Fault Tolerance

Failure Models Response Failure: Response of a component is simply incorrect. Example; Response of a server is simply incorrect. Value fault/failure A fault that causes a component to respond within the correct time interval but with an incorrect value is termed a value fault (with the corresponding failure called a value failure). Example: Faulty Communication link State Transition Failure; When a component reacts unexpectedly to an incoming request. Example: server receives an unrecognizable message. Fault Tolerance

Failure Models • Arbitrary failures: It is possible for a component to fail in both the time and the value domains in a manner which is not covered by one of the previous classes. A failed component which produces such an output will be said to be exhibiting an arbitrary failure (Byzantine failure). • Example: • Server- Incorrect output not detected. • Malicious collusion. Fault Tolerance

Agreement in Faulty Systems • In most cases we assume that a process group reaches an agreement. • Examples: • Coordinator election. • Commit/not to commit • Task division. • Synchronization. • Achieving an agreement can be non trivial. • Assumption: Processes cooperate: May not be the case. Fault Tolerance

Agreement in Faulty Systems • Challenge: Consensus amongst • Non faulty processes • Finite steps. • Problem: Different assumption about the underlying system require different solutions. • Synchronous versus Asynchronous • Delay bounded or not. • Delivery ordered or not. • Unicasting versus multicasting. Fault Tolerance

Agreement in Faulty Systems Circumstances under which distributed agreement can be reached. Fault Tolerance

Byzantine Agreement Problem • The problem of reaching a consensus among distributed units if some of them give misleading answers. • the problem is couched in terms of generals deciding on a common plan of attack. • Some traitorous generals may lie about whether they will support a particular plan and what other generals told them. Exchanging only messages, what decision making algorithm should the generals use to reach a consensus? • What percentage of liars can the algorithm tolerate and still correctly determine a consensus? Fault Tolerance

Byzantine Agreement Problem Fault Tolerance

Byzantine Agreement Problem Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. K= faulty, 2k+1=non faulty, total= 3k+1 Fault Tolerance

Byzantine Agreement Problem The same as in the previous case, except now with two correct process and one faulty process. Fault Tolerance

Reliable Client-Server Communication • Reliable Point to Point Communication: TCP • TCP masks Omission failures: Acknowledgements and Retransmissions. • Crash failures are not masked: Connection abruptly broken • Resend a connection request. Fault Tolerance

RPC Semantics in the Presence of Failures • Five different classes of failures that can occur in RPC systems: • The client is unable to locate the server. • The request message from the client to the server is lost. • The server crashes after receiving a request. • The reply message from the server to the client is lost. • The client crashes after sending a request. Fault Tolerance

Server Crashes (1) • Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution. Fault Tolerance

Server Crashes • Three Approaches: • At Least One Semantics • At Most One Semantics. • Guarantee Nothing. • Exactly One Semantics. Fault Tolerance

Server Crashes (2) Scenario: Remote Print Job • Client sends request, sever has 2 options • First send completion and then tell the printer • First tell the printer and then send the completion. • Three events that can happen at the server: (in different orderings) • Send the completion message (M), • Print the text (P), • Crash (C). • Strategies for Client to follow: • Never Issue a request • Always issue a request • Reissue only in absence of acknowledgement for delivery • Reissue only in absence of acknowledgement for print Fault Tolerance

Server Crashes (3) • These events can occur in six different orderings: • M →P →C: A crash occurs after sending the completion message and printing the text. • M →C (→P): A crash happens after sending the completion message, but before the text could be printed. • P →M →C: A crash occurs after sending the completion message and printing the text. • P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. • C (→P →M): A crash happens before the server could do anything. • C (→M →P): A crash happens before the server could do anything. Fault Tolerance

Server Crashes (4) • Figure 8-8. Different combinations of client and server strategies in the presence of server crashes. Fault Tolerance

Lost Reply Messages • Appears like a server crash. • Idempotent Request can be repeated and not Non Idempotent requests. • Example: Bank Accounts • Other mechanisms: • Use of sequence numbers • Book Keeping Fault Tolerance

Client Crashes • If client crashes before getting the reply, computation becomes orphan (computation is assumed to take a long time). • Orphans • Harmful • Cause of Confusion Solutions • Orphan extermination • Grand Orphans • Reincarnation: Using time based epochs. • Gentle Reincarnation • Expiration Fault Tolerance

Fault Tolerance in Groups • The key approach to tolerating a faulty process is to organize several identical processes in to a group. • Collections of processes dealt as single abstraction. • When a message is sent: All members of group get. • We have seen several group communication models already. • The objective is to see how fault tolerance can be achieved. Fault Tolerance

Receipt & Delivery • The communication layer on a node receives a message • It informs all other node’s communication layers that it has • When it has received all such messages from all other nodes it delivers the message Fault Tolerance

Send Message 3 2 4 1 5 8 6 7 Fault Tolerance

Send Confirms 3 2 4 1 5 8 6 7 Fault Tolerance

Reliability • First, use something like TCP • Reliable point to point • Ordered • What happens if sender fails during sending? Fault Tolerance

Reliability • When message delivered it is marked as stable • So what to do with unstable messages? • What if group changes? Fault Tolerance

Send Message 3 2 4 1 5 8 6 7 Fault Tolerance

Failure • Confirms will not all be received • Message will not be marked as stable • Removal of failed process from group will be noticed Fault Tolerance

Group Membership Change • Process that registers change will notify all other remaining members • On receipt of such message process • multicasts unstable messages • Multicasts flush message • When it receives flush message from all remaining processes knows new group membership Fault Tolerance

Group Membership Change 3 2 4 1 5 8 6 7 Fault Tolerance

Unstable & Flush – process 2 3 2 4 Unstable message Flush message 1 5 8 6 7 Fault Tolerance

Distributed Commit • Having an operation being performed by each member of a process group, or none at all. • Reliable multicasting: Delivery • Distributed Transaction: Commit. • Often established by means of a coordinator • Participants told to perform an operation. • Distributed Commit implemented using • Two phase Commit • Three Phase Commit. Fault Tolerance

Two-Phase Commit (1) • Step 1 : Coordinator send VOTE_REQUEST to all participants. • Step 2: Participant can return a VOTE_COMMIT or a VOTE_ABORT • Step 3: Coordinator gathers response: Issues GLOBAL_COMMIT or a VOTE_ABORT. • Step 4: Participant either commits or aborts. Fault Tolerance

Two Phase Commit Fault Tolerance

Two-Phase Commit (1) • (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Fault Tolerance

Two-Phase Commit (2) • Actions taken by a participant P when residing in state READY and having contacted another participant Q. Fault Tolerance

Two-Phase Commit (3) • Outline of the steps taken by the coordinator in a two-phase commit protocol. . . . Fault Tolerance

Two-Phase Commit (4) • Outline of the steps taken by the coordinator in a two-phase commit protocol. . . . Fault Tolerance

Two-Phase Commit (5) • (a) The steps taken by a participant process in 2PC. Fault Tolerance

Two-Phase Commit (7) • . (b) The steps for handling incoming decision requests.. Fault Tolerance

Three-Phase Commit (1) • The states of the coordinator and each participant • satisfy the following two conditions: • There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state. • There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made. Fault Tolerance

Three-Phase Commit (2) • Figure 8-22. (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. Fault Tolerance

Recovery • So far, the focus was on algorithms that tolerated faults. • Once a failure occurred, it is essential to bring the process to a correct state (before the failure happened) • What do we mean by recovery? • How are the states recorded Fault Tolerance

ITEC801 Distributed Systems