Failure Mode Assumptions and Assumption Coverage

Failure Mode Assumptions and Assumption Coverage David Powell

Fault-Tolerance • Key questions • How components may fail?  Prevention strategies • At what rate they may fail?  The Amount of redundancy needed • What are the important type of faults? • Types of redundancy needed • The relation between dependability, redundancy and faults? • General FT design guidelines

An F-T Paradox/Dilemma • More faulty  More redundancy More possibility of faults • ???

Solution- Some Key Steps Classify, quantify and verify the assumptions

Type of Failures

Overview • Single-user service • Service Model • Potential Errors • Multiple-user service • Service Model • Potential Errors

Single-user Service Model • Service items: si, i=1,2,… • Values of si: vsi • Observation time of si: tsi • Service Model: Si= <vsi, tsi> • An omniscient observer

Correctness Model • Service item si is correct iff (vsi SVi)  (tsi STi) • SVi and STi are respectively the specified sets of values and times for service item si

Potential Errors • Arbitrary value error: si : vsi SVi • Noncode error: si : vsi CV(CV defines a code) • Arbitrary timing error: si : tsi STi • Early timing error: si : tsi < min(STi) • Late timing error: si : tsi > max(STi) • Omission error: si : tsi =  • Impromptu error: si: (vsi = )  (tsi = )

Multi-user Service Model • Service item si={si(1), si(2),…, si(n),} • Service model: <vsi(u), tsi(u)>, all i,u • New issues: “consistency”

Correctness Model • vsi(u)– the value of service item i on process u • vsi-- the value of service item i • SVi– the set of specified service item i • tsi(u)– the observation time of service item i on process u • STi(u) – the range of specified observation time of service item i on process u • uv -- the time bound of related occurrences

Examples of Potential Errors • Consistent value error • Consistent timing error • Semi-consistent value error

Failure Mode Assumptions Attempt to formalize the concept of an assumed failure mode By assertions on the sequences of service items delivered by a component

Examples of Value Error Assertions • No value errors occur (Vnone) i , vsi  SVi • The only value errors that occur are noncode value errors (Vn) i , (vsi  SVi)  (vsi CV) • Arbitrary value error can occur (Varb) i , (vsi  SVi)  (vsi SVi )

Examples of Timing Error Assertions • No timing error occurs (Tnone) • The only timing errors are omission errors (TO) • The only timing errors are late timing errors (TL) • The only timing errors are early timing errors (TE) • Arbitrary timing error can occur (Tarb) • Permanent omission/crash (Tp) • Bounded omission degree (TBk)

Timing Error Implications

Failure Mode Assertions(FMA) • A complete FMA entails an assertion on errors occurring on both value and time domains • By taking the Cartesian production of the two domains, we get a family of FMA

FMA Implication Graph

So what? • The FMA classification and implication graph can serve as a guideline to design families of FT algorithms that can process errors in increasing severity!

Assumption Coverage Establishing a link between assumed component failure mode and system dependability (The design a FT system relies on the assumption they make) (The dependability of a FT system is related to the failure mode they assume)

Motivation • Components may fail • They may fail in a bad way  leads to a violation of assumptions of the system • The system, in turn, can fail • Question: to what degree can a component FMA prove to be true in the real system?

The Coverage of the Assumption • Definition P(X) = Pr{ X= true | component failed} • P(Varb Tarb) = 1 • P(Vnone Tnone) = 0

Coverage of an FT system PS(X) = Pr{ correct error processing |X= true} *Pr{ X= true | component failed}

Influence of Assumption Coverage on System Dependability A Case Study

The System • A system of n processors • Connected via unidirectional message-passing bus • Each processor carries out the same computation steps • The result of each processing step is communicated to all other processors • Each process has a decision function (DF) • The DF is applied to the results received from other processors • … • Each processor and its associated bus is viewed as a single component

Fail-Silent Processor-bus • A fail-silent processor • Only has semi-consistent value errors • Always produces message on time • Or ceases to produce messages forever • If a message is delivered to a processor, it is to be delivered to all processors with consistent fixed delay

Fail-Consistent Processor Bus • Only semi-consistent value errors may occur • Faulty processors may send erroneous values • Consistent timing error may occur

Fail-uncontrolled Processor Bus • Arbitrary timing error • Arbitrary value error

Implications of Assumption Coverage • Failure mode relations • Coverage relations

Dependability Expressions From Markov Models • r = e –λt • λ = failure rate

A Life-critical Application • System reliability objective: R > 1-10-9 over 10 hours • Single processor reliability: • r = e-λt • 1/λ = 5 years

A Money-Critical Application • It is about availability of the system rather than reliability of the system • Please look at the paper for more details

Unavailability v.s. Coverage

Conclusion • A formalism for describing component failure modes • Multiplicity of value and timing errors • The notion of assumption coverage • The relation between dependability, availability and assumption coverage

Thank you

Failure Mode Assumptions and Assumption Coverage