340 likes | 346 Views
This paper discusses the enhancement of fault-tolerance in nonmasking programs, addressing the complexity of automatic addition and proposing heuristics for step-wise addition. It also explores enhancement in high atomicity models and for distributed programs.
E N D
Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer Science and Engineering Department Michigan State University
Acknowledgement • This work is partially sponsored by: • NSF, • DARPA NEST, • ONR URI, and • Michigan State University
Motivation • Programs are subject to unanticipated faults • Encounter new classes of faults, add corresponding fault-tolerance • How to add fault-tolerance? • Develop from scratch (expensive approach) • Incrementally add fault-tolerance • Reuse of the behaviors of the fault-intolerant program • Potential to preserve properties that are hard to specify (e.g., efficiency) • How to ensure correctness? • After the fact verification • Automatic addition of fault-tolerance (correct by construction)
Motivation (Continued) • Problem: Complexity of automatic addition • Automatic addition of fault-tolerance to distributed programs is NP-hard [FTRTFT00], [ICDCS02] • How do we deal with this complexity? • Develop heuristics • Identifying the boundary of polynomial-time addition • Step-wise addition (weaker forms of fault-tolerance) • The goal of this paper • Enhance the fault-tolerance of nonmasking programs • Partial automation of fault-tolerance programs
Outline • Preliminary Concepts • Enhancement Problem • Enhancement in High Atomicity Model • Enhancement for Distributed Programs • Example: Byzantine Agreement Program • Conclusion and Future Work
f Program T S p/f p Fault Preliminary Concepts:Programs and Faults • Finite State space Sp • Invariant S, fault-span T Sp • Program p, Fault f, Safety { (s0, s1) | (s0, s1) Sp Sp } • Fault-tolerance • Failsafe, Nonmasking, Masking Sp
Step-Wise Addition Masking fault-tolerant This paper Failsafe fault-tolerant Nonmasking fault-tolerant [FTRTFT00] [ICDCS02] Intolerant Program
S' = T' S f T T' Enhancement Problem Nonmasking program p Masking program p' Synthesis Algorithm Specification Spec Invariant S' Invariant S Faults f Fault-span T' Requirements: Only fault-tolerance is added; no new functional behavior is added Sp S
f ms ms: States from where safety will be violated by fault transitions Enhancement in High Atomicity Model • High Atomicity Model • Each process can read/write all program variables T S
S' T' Enhancement in High Atomicity Model – (Continued) • Find a state predicate T' such that: • T' is closed in the computations of the program in the presence of faults • The specification is satisfied from every state of T' (i.e., no deadlocks) • Construct p' such that for every (s0, s1) p': • (s0, s1) does not violate safety • s0 T' s1 T' T S ms • Deadlock States appear due to removing some transitions
Enhancement Addition Fault-intolerant program Masking program Automatic: Enhancement Manual Nonmasking program [FTRTFT00] HighAtomicityEnhancement (p,f: transitions, T:StatePredicate, specification spec) { Calculate ms; Calculate mt; T' = ConstructFaultSpan( ); if ( T' = {} ) declare no masking f-tolerant program exists;exit; elseConstruct the transitions of p'; } AddMasking (p,f: transitions, S:StatePredicate, specification spec) { 1. Calculate ms; Calculate mt; 2. . . . 3. . . . 4. repeat 4-1) . . . 4-2) . . . 4-3) T := ConstructFaultSpan( ); 4-4) . . . 4-5) if (S = {} \/ T = {}) declare no masking f-tolerant program exists; exit; until (ExitConditionHolds); 5. Remove cycles in outside the invariant in T; 6. Construct the transitions of p'; } Partial Automation
a=1,b=0 a=0,b=0 Only if we include the transition a=1,b=1 a=0,b=1 Difficulties with Distribution • Read/Write restrictions (low atomicity model). • A program p • Two processes j, k • Two Boolean variables a and b • Process j cannot read b • Can we include the following transition? Groups of transitions (instead of individual transitions) must be chosen.
Enhancement of Nonmasking Distributed Programs Start Calculate T'high Calculate S'init = S'low Calculate Sreachable from S'low by fault/program transitions Search in (T'high– S'low) Under distribution restrictions Calculate Srecovery from where recovery is possible to S'low S'low = S'low Srecovery T' = S'low Calculate p' transitions No Srecovery = {} Yes Declare failure No Yes Sreachable = {} Stop
S' high = S T' high T' high A High Atomicity Fault-Span • The largest possible domain for the states that can be included in the fault-span of the distributed program ms T S
T' high S0 S'init S'high The Initial Low Atomicity Invariant • Remove states from where an outgoing transition crosses the boundary of S'high • E.g., s0 • Removal is a non-deterministic choice, where we have more than one state to remove
S2 S3 T'high S1 f S'low S0 S1 Sreachable S2 S3 Single-Step Reachable States • Reachable by a fault/program transition (denoted Sreachable) S'init
S2 S3 T'high S0 Srecovery S2 S3 Single-Step Recovery States • Safer recovery in a single step (denoted Srecovery) • Goal: infinite computations are possible from all states in S'low • s0 represents a typical recovery state S'init S'low
Enhancement of Nonmasking Distributed Programs Start Calculate T'high Calculate S'init = S'low Calculate Sreachable from S'low by fault/program transitions Calculate Srecovery from where recovery is possible to S'low S'low = S'low Srecovery T' = S'low Calculate p' transitions No Srecovery = {} Yes Declare failure No Yes Sreachable = {} Stop
Example: Byzantine Agreement • Why this example? • Was used to illustrate the addition of masking fault-tolerance in [SRDS01] • Manual enhancement has been already applied [TSE98] • Processes: General, g, and three non-generals j, k, and l • Variables • d.g : {0, 1} • d.j, d.k, d.l : {0, 1, ┴ } • b.g, b.j, b.k, b.l : {0, 1} • f.j, f.k, f.l : {0, 1} • Safety Specification: • Agreement: No two non-Byzantine non-generals can finalize with different decisions • Validity: If g is not Byzantine, no process can finalize with different decision with respect to g • A finalized process should not execute any transition g j k l
Example: Byzantine Agreement • Read/Write restrictions • Readable variables for process j • b.j, d.j, f.j, d.g, d.k, d.l • Process j can write d.j, f.j • Disjkstra’s guarded commands • Guard Statement • { (s0, s1) | Guard holds at s0 and atomic execution of Statement yieldss1 } • Nonmasking fault-tolerant program transitions • d.j = ┴ f.j = 0 d.j := d.g • d.j ≠ ┴ f.j = 0 f.j := 1 • d.j = 1 d.k = 0 d.l = 0 d.j := 0 • d.j = 0 d.k = 1 d.l = 1 d.j := 1 • Fault transitions • ¬b.g ¬b.j ¬b.k ¬b.l b.j := true • b.j d.j :=0|1
Example: Byzantine Agreement (Continued) • Why enhancement is easier? d.j = d.k =┴, d.g = 1, d.l = 1, f.l = 0 S0 A good transition inside the invariant Premature finalization d.j = d.k =┴, d.g = 1, d.l = 1, f.l = 1 S1 Fault transition d.j = d.k =┴, d.g = 0, d.l = 1, f.l = 1 S2 b.g = 1 d.j = d.k =┴, d.g = 0, d.l = 1, f.l = 1 S3 d.j = d.k =0, d.g = 0, d.l = 1, f.l = 1 A deadlock state S4
Example: Byzantine Agreement (Continued) • Masking fault-tolerant program • High atomicity reasoning • Synthesize a masking program in high atomicity and then refine it to a distributed program d.j = ┴ f.j = 0 d.j := d.g d.j ≠ ┴ f.j = 0 f.j := 1 d.j = 1 d.k = 0 d.l = 0 d.j := 0 d.j = 0 d.k = 1 d.l = 1 d.j := 1 ((d.j = d.k) (d.j = d.l)) (f.j = 0) (f.j = 0)
Enhancement vs. Addition • Reuse the computations of the nonmasking program • Reasoning in high atomicity model has the potential to reduce the complexity of addition
Synthesis Framework • Development of a synthesis framework • Developers of fault-tolerance can interactively add fault-tolerance to fault-intolerant programs • Partial automation helps us to reap the benefits of automation as much as possible • Enhancement identifies programs where partial automation is possible • Implementation of enhancement algorithms in the synthesis framework • http://www.cse.msu.edu/~sandeep/software/Code/synthesis-framework/
Conclusion and Future Work • Enhancement simplifies automated design of masking programs • Less asymptotic complexity • Polynomial-time enhancement in the low atomicity model (in the state space of the nonmasking program) • Sound, but not complete • Reasoning in high atomicity simplifies the synthesis of masking distributed programs • Future Work: • A polynomial-time sound and complete enhancement algorithm for a restricted class of programs and specifications
Thank You! Questions?
Example: Triple Modular Redundancy • Processes: Three processes: j, k, and l • Variables and their domains • in.j, in.k, and in.l are Boolean variables • out belongs to { 0, 1, ┴ } • Nonmasking program (+ addition in modulo 3): N1: (out = ┴) out := in.j N2: (out != ┴) /\ (out != in.j) /\ ((in.j = in.k) \/ (in.j = in.l)) out := in.j • Faults: F: (in.j = in.k) /\ (in.j = in.l) in.j := 0|1 • Safety specification: • Do not reach states where out is different than the majority of inputs. • out should not be changed after it is assigned a value.
Example: Triple Modular Redundancy • Invariant: S = ((out = ┴) /\ (in.j = in.k = in.k)) \/ (out = in.j = in.k) \/ (out = in.j = in.l) \/ (out = in.k = in.l) • Fault-span: T = ( (in.j = in.k = in.l) => ((out = ┴) \/ (out = in.j = in.k = in.l)) ) • Enhancement algorithm: • Compute ms: ms = { } • Remove bad transitions: {t: t violates safety} and {t: t reaches ms} • Construct a new fault-span T’: T’ = T – { s: (out !=┴) /\ (out is not equal to majority of inputs) } • Masking program: M1: (out = ┴) /\ (in.j = in.k) \/ (in.j = in.l) out := in.j
Enhancement of Nonmasking Distributed Programs Start Calculate T'high Calculate S'init = S'low Calculate Sreachable from S'low by fault/program transitions Calculate Srecovery from where recovery is possible to S'low S'low = S'low Srecovery No Srecovery = {} Yes Declare failure No Yes Sreachable = {} T' = S'low , calculate p' transitions
Enhancement of Nonmasking Distributed Programs Start Calculate T'high Calculate S'init = S'low Calculate Sreachable from S'low by fault/program transitions Calculate Srecovery from where recovery is possible to S'low S'low = S'low Srecovery No Srecovery = {} Yes Declare failure No Yes Sreachable = {} T' = S'low , calculate p' transitions
Enhancement of Nonmasking Distributed Programs Start Calculate T'high Calculate S'init = S'low S'init = S'low at the first iteration Calculate Sreachable from S'low by fault/program transitions Calculate Srecovery from where recovery is possible to S'low S'low = S'low Srecovery No Srecovery = {} Yes Declare failure No Yes Sreachable = {} T' = S'low , calculate p' transitions
Enhancement of Nonmasking Distributed Programs Start Calculate T'high Calculate S'init = S'low Calculate Sreachable from S'low by fault/program transitions Calculate Srecovery from where recovery is possible to S'low S'low = S'low Srecovery No Srecovery = {} Yes Declare failure No Yes Sreachable = {} T' = S'low , calculate p' transitions