210 likes | 417 Views
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments. Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003. Presentation Overview. Introduction Terminology Formal View of Fault Tolerance Four Types of Fault Tolerance
E N D
Fundamentals of Fault-TolerantDistributed Computing InAsynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003
Presentation Overview • Introduction • Terminology • Formal View of Fault Tolerance • Four Types of Fault Tolerance • Redundancy as the Key to Fault Tolerance • Models of Computation And Their Relevance • Achieving Safety • Achieving Liveness • Conclusions
Introduction • Until early 1990s, work in fault-tolerant computing focused on specific technologies and applications. • Resulted in distinct terminologies and methodologies • Goals • Structure the area clearly. • Survey the fundamental building blocks.
Terminology States, Configurations, and Guarded Commands • distributed system: finite set of processes. • Local state: variables of each process. • State transition: defines event (send, receive, or internal event). • Guarded Commands: abstractly represent a local algorithm. <guard> => <command> • Configuration: consists of local states of all processes plus state of communication subsystem.
process Ping var z : IN init 0 ack : boolean init true begin ¬ack ^ rcv(m) => ack := true; z := z + 1 ack => snd(a); ack := false end process Pong var wait : boolean init true begin ¬wait => snd(m); wait := true wait ^ rcv(a) => wait := false end
Terminology (continued) Defining Faults and Fault Models • Fault: may cause an error • Error: may lead to a failure • Failure: system has left its correctness specification. • Models: • Crash failure, Fail-stop, and Byzantine • Fault: can be modeled as an unwanted state transition of a process
Terminology (continued) Properties of Distributed Systems: Safety and Liveness • Safety property: some specific “bad thing” never happens within system. • Liveness property: claims some “good thing” will eventually happen during system execution. • Problem Specification: consists of a safety and a liveness property
Formal View of Fault Tolerance Definition: • A distributed program A is said to tolerate faults from a fault class F for an invariant P iff there exists a predicate T for which the following requirements hold: • P => T • T is closed in A and T is closed in F • Starting from any state where T holds, every computation that executes actions from A alone eventually reaches a state where P holds.
Four Types of Fault Tolerance Liveness Property Satisfied Yes No Masking Fail Safe Yes Safety Property Satisfied Nonmasking None No
Redundancy as the Key to Fault Tolerance Defining Redundancy: • A distributed program (A) is said to be redundant in space iff for all executions e of A in which no faults occur, the set of all configurations of A contains configurations that are not reached in e. • A is said to redundant in time iff for all executions of e in which no faults occur, the set of actions of A contains actions that are never executed in e. • A program is said to employ redundancy iff it is either redundant in space or time.
Example: program with redundancy in space and in time process Redundancy var x ε {0, 1, 2} init 1 {* local state *} begin {* normal program actions: *} x = 1 => x := 2 {* 1 *} x = 2 => x := 1 {* 2 *} x = 0 => x := 1 {* 3 *} {* fault action: *} true => x := 0 end
Redundancy as the Key to Fault Tolerance (continued) Claim: • If A is a nontrivial distributed program that does not employ redundancy, then A may become incorrect regarding its correctness specification in the presence of faults. Conclusion: • While redundancy is not sufficient for fault tolerance, it is a necessary condition. • Redundancy in space is widespread
Models of Computation And Their Relevance Models of Distributed Systems • Synchronous systems: there are real-time bounds on message transmission and process response times. • Partially synchronous: intermediate models that have bounds to a varying degree. • Asynchronous systems: no bounds made. • Weakest model and realistic model in many applications. • Every algorithm that works on this model, works on all other models. • Cannot detect whether a process has crashed or not?
Achieving Safety: Detection as the Basis for Safety • To ensure safety, we need to employ detection and subsequently inhibit dangerous actions. • Common Detection Mechanisms: parity, checksums • Detection includes checking whether a certain predicate Q holds over the entire system • Q is easier to specify if the type and effect of faults from F are known.
Achieving Safety: Detection in Distributed Settings • Deciding whether a predicate over the global state does or does not hold is not easy. • Cooper and Marzullo introduced two transformers: • Possibility(Q) is true iff there exists a continuous observation of the computation for which Q holds at some point. • Definitely(Q) is true iff for all possible continuous observations of the computation Q holds at some point.
Achieving Safety: Adapting Consensus Algorithms • Set of processes (each process has an initial value) must all decide on a common value. • Central process acts as an observer that can construct all possible observations. • Central process scheme not very fault tolerant: • Central observer can crash • Central observer can send arbitrary messages • Solution: diffuse information among all nodes.
Achieving Safety: Detecting Process Crashes • Fully Asynchronous model: impossible to detect • Chandra and Toueg proposed unreliable failure detectors to extend the asynchronous model. • The main property of failure detectors is accuracy: • Weak: failure detector will never suspect at least one correct process of having crashed. • Eventually Weak: failure detector may suspect every process at one time or another, but there is a time after which some correct process is no longer suspected.
Achieving Liveness: Correction • Liveness tied to notion of correction. • Correction refers to turning a bad state into a good one. • Common methods include: • retransmission, error-correction codes, rollback recovery, rollforward recovery, etc. • On detecting a bad state via a detection predicate Q, the system must try to impose a new target predicate R onto the system.
Achieving Liveness: Correction via Consensus • Correction corresponds to the decision phase of consensus algorithms. • State machine approach (Schneider) • Servers are made fault tolerant by replicating them and coordinating their behavior via consensus algorithms. • Other methods based on several forms of fault-tolerant broadcasts.
Conclusions • This paper introduces a formal approach to structure the area of fault-tolerant distributed computing, survey fundamental methodologies, and discuss their relations. • This approach reveals the inherent limitations of fault-tolerance methodologies and their interactions with system models. • This paper could not integrate the entire area of fault-tolerant distributed computing. • Many topics still need further attention.