190 likes | 207 Views
A Theory of Fault-Tolerance. Unifying Fault-Tolerance Approaches. Several disciplines with focus on different faults and specific architectures Crash recovery Atomic transactions Fault-tolerance of digital systems Fault-tolerance in message-passing systems Verification of fault-tolerance
E N D
Unifying Fault-Tolerance Approaches • Several disciplines with focus on different faults and specific architectures • Crash recovery • Atomic transactions • Fault-tolerance of digital systems • Fault-tolerance in message-passing systems • Verification of fault-tolerance • Application-specific • Verify recovery and safely terminate (mask the faults) • Less attention given to non-maskable faults [Arora 1992]A foundation of fault-tolerant computing, PhD thesis, University of Texas- Austin, 1992.
A Foundation of Fault-Tolerant Computing • Provide a uniform definition of fault-tolerance • Provide verification methods independent of technology, architecture, or application [Arora 1992]A foundation of fault-tolerant computing, PhD thesis, University of Texas- Austin, 1992.
f p f f Program and Fault • Program model: synchronization skeleton of finite-state programs • Finite number of variables with finite domains • Finite number of processes • State: a valuation of program variables • Finite state spaceSp • Program p, Faultf Sp Sp • Use Dijkstra’s Guarded Commands (actions) as a shorthand to represent program and fault transitions • Guard Statement; Sp Program Fault
Examples of Intermittent Faults • Intermittent faults • Sudden acceleration in cruise control systems • E.g., Cruise control that only works in wet weather • Malfunction in a component of an electronic circuit when the voltage goes beyond a threshold • x and y are two points of contacts in a circuit that have independent voltages. However, when the voltage level of x goes beyond 3.5 v, y gets the same voltage as x. We model this class of faults by the following guarded command x > 3.5 y := x;
Examples of Transient Faults • Transient faults • A hardware interrupt routine gets called without any interrupt being raised by hardware devices • Solar radiation corrupts the communication and the navigation systems • The variables of the controlling software of space shuttles may be corrupted by transient solar radiations true x := ; The above guarded command means that at any state of the system, the variable x may be corrupted due to transient faults
Transient vs. Intermittent Faults • Transient faults are difficult (if not impossible) to reproduce • Can we reproduce solar radiations? • Intermittent faults may be reproduced under certain conditions • E.g., pressing the ‘Ctrl’ key causes the system to reset
X p State Predicate • State predicate X X Sp • Closure: X is closed in p • Projectionp|X {(s0, s1) | (s0, s1) p s0 X s1 X } Sp
Program computations in the presence of faults (denoted p[]f ) • Infinite sequences of program and fault transitions • Computation prefix • Finite sequences of program transitions . . . s0 s1 s2 s3 sn . . . . . . s0 s0 s1 s1 s2 s2 s3 s3 s4 s5 s6 Program Computations • Program computations • Infinite sequences of program transitions
T S f p[]f p Specification, Invariant, and Fault-Span • Safety specification: something badnever happens • Formal representation Sp Sp (set of bad transitions) • E.g., transitions that change the value of a counter from non-zero values to zero • Liveness specification: something goodwill eventually happen • In the absence of faults, fault-tolerant program p’ satisfies the liveness specification of the fault-intolerant program p • Invariant S, fault-span T Sp Sp
P0 P1 P3 P2 Token Ring Example • Processes: P0, P1, P2, P3 • Variables: x0 , x1 ,x2 ,x3 (domain: {0, 1, }) • Dijkstra’s Guarded Commands (actions) • Guard Statement; • Fault-intolerant program • Process P0 TR0: (x0 = 1) (x3 = 1) x0 := 0; TR’0: (x0 = 0) (x3 = 0) x0 := 1;
Token Ring Example – Continued • Processes P1, P2, P3 TRi: (xi = 0)(x(i-1) = 1) xi := 1; TR’i: (xi = 1)(x(i-1) = 0) xi := 0; • Fault transitions: process-restart true xj := ;
Token Ring Example – Continued • Invariant: (state is represented as a tuple: <x0, x1, x2, x3>) <0, 0, 0, 0>, <0, 1, 1, 1>, <1, 0, 0, 0>, <0, 0, 1, 1>, <1, 1, 0, 0>, <0, 0, 0, 1> <1, 1, 1, 0>, <1, 1, 1, 1>, • Safety Specification • Corrupted value does not affect a non-corrupted process • There is only one token in the ring • Liveness of the fault-intolerant program • Token should be circulated infinitely often
No such transitions Such transitions are allowed S p Defining Fault-Tolerance: Closure • Let S be a state predicate of a program p, S is closed in piff for every action G -> st; executing st in a state of (S G) results in a state in S Sp
T S p Defining Fault-Tolerance: Convergence • Let S and T be state predicates of program p T converges to S in piff • S is closed in p • T is closed in p • Starting in T, each computation of p reaches a state in S Sp
T S f p[]f p p Levels of Fault-Tolerance • Failsafe (program p’ is failsafe f-tolerant for spec from S) • Guarantee safety in the presence of faults • Nonmasking (program p’ is nonmasking f-tolerant for spec from S) • Guarantee recovery in the presence of faults • Masking (program p’ is masking f-tolerant for spec from S) • Guarantee safety and recovery in the presence of faults Sp Safety-violating transitions
Component-Based Design of Fault-Tolerance A fault-tolerant program = A fault-intolerant program + Fault-tolerance components Two types of fault-tolerance components necessary and sufficient for the design of faults tolerance; detectors and correctors [Kulkarni 1999] Component-Based Design of Fault-tolerance, PhD thesis, The Ohio State University, 1999.
Synthesis of Fault-Tolerance • It is difficult to anticipate all classes of faults at the design time • New classes of faults requires the addition of corresponding level of fault-tolerance • Can we do it automatically? Fault-intolerant program p Synthesis Algorithm Fault-tolerant program p’ f [Ebnenasir 2004]Automatic Synthesis of Fault-tolerance, PhD thesis, Michigan State University, 2004.
Conclusion • Fault-tolerance is an important factor in the survivability of software systems • A well-defined need for • the design of correct fault-tolerant programs • the design of programs that tolerate multiple classes of faults (multitolerance) • development methodologies that provide correctness guarantees • Automatic addition of fault-tolerance generates a program that is correct by construction • Future work: • Developing tools for automation