360 likes | 520 Views
Failure Mode Assumptions and Assumption Coverage. David Powell. Fault-Tolerance. Key questions How components may fail? Prevention strategies At what rate they may fail? The Amount of redundancy needed What are the important type of faults? Types of redundancy needed
E N D
Failure Mode Assumptions and Assumption Coverage David Powell
Fault-Tolerance • Key questions • How components may fail? Prevention strategies • At what rate they may fail? The Amount of redundancy needed • What are the important type of faults? • Types of redundancy needed • The relation between dependability, redundancy and faults? • General FT design guidelines
An F-T Paradox/Dilemma • More faulty More redundancy More possibility of faults • ???
Solution- Some Key Steps Classify, quantify and verify the assumptions
Overview • Single-user service • Service Model • Potential Errors • Multiple-user service • Service Model • Potential Errors
Single-user Service Model • Service items: si, i=1,2,… • Values of si: vsi • Observation time of si: tsi • Service Model: Si= <vsi, tsi> • An omniscient observer
Correctness Model • Service item si is correct iff (vsi SVi) (tsi STi) • SVi and STi are respectively the specified sets of values and times for service item si
Potential Errors • Arbitrary value error: si : vsi SVi • Noncode error: si : vsi CV(CV defines a code) • Arbitrary timing error: si : tsi STi • Early timing error: si : tsi < min(STi) • Late timing error: si : tsi > max(STi) • Omission error: si : tsi = • Impromptu error: si: (vsi = ) (tsi = )
Multi-user Service Model • Service item si={si(1), si(2),…, si(n),} • Service model: <vsi(u), tsi(u)>, all i,u • New issues: “consistency”
Correctness Model • vsi(u)– the value of service item i on process u • vsi-- the value of service item i • SVi– the set of specified service item i • tsi(u)– the observation time of service item i on process u • STi(u) – the range of specified observation time of service item i on process u • uv -- the time bound of related occurrences
Examples of Potential Errors • Consistent value error • Consistent timing error • Semi-consistent value error
Failure Mode Assumptions Attempt to formalize the concept of an assumed failure mode By assertions on the sequences of service items delivered by a component
Examples of Value Error Assertions • No value errors occur (Vnone) i , vsi SVi • The only value errors that occur are noncode value errors (Vn) i , (vsi SVi) (vsi CV) • Arbitrary value error can occur (Varb) i , (vsi SVi) (vsi SVi )
Examples of Timing Error Assertions • No timing error occurs (Tnone) • The only timing errors are omission errors (TO) • The only timing errors are late timing errors (TL) • The only timing errors are early timing errors (TE) • Arbitrary timing error can occur (Tarb) • Permanent omission/crash (Tp) • Bounded omission degree (TBk)
Failure Mode Assertions(FMA) • A complete FMA entails an assertion on errors occurring on both value and time domains • By taking the Cartesian production of the two domains, we get a family of FMA
So what? • The FMA classification and implication graph can serve as a guideline to design families of FT algorithms that can process errors in increasing severity!
Assumption Coverage Establishing a link between assumed component failure mode and system dependability (The design a FT system relies on the assumption they make) (The dependability of a FT system is related to the failure mode they assume)
Motivation • Components may fail • They may fail in a bad way leads to a violation of assumptions of the system • The system, in turn, can fail • Question: to what degree can a component FMA prove to be true in the real system?
The Coverage of the Assumption • Definition P(X) = Pr{ X= true | component failed} • P(Varb Tarb) = 1 • P(Vnone Tnone) = 0
Coverage of an FT system PS(X) = Pr{ correct error processing |X= true} *Pr{ X= true | component failed}
Influence of Assumption Coverage on System Dependability A Case Study
The System • A system of n processors • Connected via unidirectional message-passing bus • Each processor carries out the same computation steps • The result of each processing step is communicated to all other processors • Each process has a decision function (DF) • The DF is applied to the results received from other processors • … • Each processor and its associated bus is viewed as a single component
Fail-Silent Processor-bus • A fail-silent processor • Only has semi-consistent value errors • Always produces message on time • Or ceases to produce messages forever • If a message is delivered to a processor, it is to be delivered to all processors with consistent fixed delay
Fail-Consistent Processor Bus • Only semi-consistent value errors may occur • Faulty processors may send erroneous values • Consistent timing error may occur
Fail-uncontrolled Processor Bus • Arbitrary timing error • Arbitrary value error
Implications of Assumption Coverage • Failure mode relations • Coverage relations
Dependability Expressions From Markov Models • r = e –λt • λ = failure rate
A Life-critical Application • System reliability objective: R > 1-10-9 over 10 hours • Single processor reliability: • r = e-λt • 1/λ = 5 years
A Money-Critical Application • It is about availability of the system rather than reliability of the system • Please look at the paper for more details
Conclusion • A formalism for describing component failure modes • Multiplicity of value and timing errors • The notion of assumption coverage • The relation between dependability, availability and assumption coverage