A Theory of Fault-Tolerance

A Theory of Fault-Tolerance

Unifying Fault-Tolerance Approaches • Several disciplines with focus on different faults and specific architectures • Crash recovery • Atomic transactions • Fault-tolerance of digital systems • Fault-tolerance in message-passing systems • Verification of fault-tolerance • Application-specific • Verify recovery and safely terminate (mask the faults) • Less attention given to non-maskable faults [Arora 1992]A foundation of fault-tolerant computing, PhD thesis, University of Texas- Austin, 1992.

A Foundation of Fault-Tolerant Computing • Provide a uniform definition of fault-tolerance • Provide verification methods independent of technology, architecture, or application [Arora 1992]A foundation of fault-tolerant computing, PhD thesis, University of Texas- Austin, 1992.

f p f f Program and Fault • Program model: synchronization skeleton of finite-state programs • Finite number of variables with finite domains • Finite number of processes • State: a valuation of program variables • Finite state spaceSp • Program p, Faultf Sp Sp • Use Dijkstra’s Guarded Commands (actions) as a shorthand to represent program and fault transitions • Guard  Statement; Sp Program Fault

Examples of Intermittent Faults • Intermittent faults • Sudden acceleration in cruise control systems • E.g., Cruise control that only works in wet weather • Malfunction in a component of an electronic circuit when the voltage goes beyond a threshold • x and y are two points of contacts in a circuit that have independent voltages. However, when the voltage level of x goes beyond 3.5 v, y gets the same voltage as x. We model this class of faults by the following guarded command x > 3.5  y := x;

Examples of Transient Faults • Transient faults • A hardware interrupt routine gets called without any interrupt being raised by hardware devices • Solar radiation corrupts the communication and the navigation systems • The variables of the controlling software of space shuttles may be corrupted by transient solar radiations true  x := ; The above guarded command means that at any state of the system, the variable x may be corrupted due to transient faults

Transient vs. Intermittent Faults • Transient faults are difficult (if not impossible) to reproduce • Can we reproduce solar radiations? • Intermittent faults may be reproduced under certain conditions • E.g., pressing the ‘Ctrl’ key causes the system to reset

X p State Predicate • State predicate X X Sp • Closure: X is closed in p • Projectionp|X {(s0, s1) | (s0, s1)  p  s0  X  s1  X } Sp

Program computations in the presence of faults (denoted p[]f ) • Infinite sequences of program and fault transitions • Computation prefix • Finite sequences of program transitions . . . s0 s1 s2 s3 sn . . . . . . s0 s0 s1 s1 s2 s2 s3 s3 s4 s5 s6 Program Computations • Program computations • Infinite sequences of program transitions

T S f p[]f p Specification, Invariant, and Fault-Span • Safety specification: something badnever happens • Formal representation Sp Sp (set of bad transitions) • E.g., transitions that change the value of a counter from non-zero values to zero • Liveness specification: something goodwill eventually happen • In the absence of faults, fault-tolerant program p’ satisfies the liveness specification of the fault-intolerant program p • Invariant S, fault-span T Sp Sp

P0 P1 P3 P2 Token Ring Example • Processes: P0, P1, P2, P3 • Variables: x0 , x1 ,x2 ,x3 (domain: {0, 1, }) • Dijkstra’s Guarded Commands (actions) • Guard  Statement; • Fault-intolerant program • Process P0 TR0: (x0 = 1)  (x3 = 1)  x0 := 0; TR’0: (x0 = 0)  (x3 = 0)  x0 := 1;

Token Ring Example – Continued • Processes P1, P2, P3 TRi: (xi = 0)(x(i-1) = 1)  xi := 1; TR’i: (xi = 1)(x(i-1) = 0)  xi := 0; • Fault transitions: process-restart true  xj := ;

Token Ring Example – Continued • Invariant: (state is represented as a tuple: <x0, x1, x2, x3>) <0, 0, 0, 0>, <0, 1, 1, 1>, <1, 0, 0, 0>, <0, 0, 1, 1>, <1, 1, 0, 0>, <0, 0, 0, 1> <1, 1, 1, 0>, <1, 1, 1, 1>, • Safety Specification • Corrupted value does not affect a non-corrupted process • There is only one token in the ring • Liveness of the fault-intolerant program • Token should be circulated infinitely often

No such transitions Such transitions are allowed S p Defining Fault-Tolerance: Closure • Let S be a state predicate of a program p, S is closed in piff for every action G -> st; executing st in a state of (S  G) results in a state in S Sp

T S p Defining Fault-Tolerance: Convergence • Let S and T be state predicates of program p T converges to S in piff • S is closed in p • T is closed in p • Starting in T, each computation of p reaches a state in S Sp

T S f p[]f p p Levels of Fault-Tolerance • Failsafe (program p’ is failsafe f-tolerant for spec from S) • Guarantee safety in the presence of faults • Nonmasking (program p’ is nonmasking f-tolerant for spec from S) • Guarantee recovery in the presence of faults • Masking (program p’ is masking f-tolerant for spec from S) • Guarantee safety and recovery in the presence of faults Sp Safety-violating transitions

Component-Based Design of Fault-Tolerance A fault-tolerant program = A fault-intolerant program + Fault-tolerance components Two types of fault-tolerance components necessary and sufficient for the design of faults tolerance; detectors and correctors [Kulkarni 1999] Component-Based Design of Fault-tolerance, PhD thesis, The Ohio State University, 1999.

Synthesis of Fault-Tolerance • It is difficult to anticipate all classes of faults at the design time • New classes of faults requires the addition of corresponding level of fault-tolerance • Can we do it automatically? Fault-intolerant program p Synthesis Algorithm Fault-tolerant program p’ f [Ebnenasir 2004]Automatic Synthesis of Fault-tolerance, PhD thesis, Michigan State University, 2004.

Conclusion • Fault-tolerance is an important factor in the survivability of software systems • A well-defined need for • the design of correct fault-tolerant programs • the design of programs that tolerate multiple classes of faults (multitolerance) • development methodologies that provide correctness guarantees • Automatic addition of fault-tolerance generates a program that is correct by construction • Future work: • Developing tools for automation

A Theory of Fault-Tolerance

A Theory of Fault-Tolerance

Presentation Transcript

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance