430 likes | 637 Views
Software Fault Tolerance – The big Picture. RTS April 2008 Anders P. Ravn Aalborg University. Fault Tolerance. Means to isolate component faults. Prevents system failures. May increase system dependability. Dependability - attributes. Availability Reliability Safety
E N D
Software Fault Tolerance –The big Picture RTS April 2008 Anders P. Ravn Aalborg University
Fault Tolerance Means to isolate componentfaults Prevents systemfailures May increase systemdependability
Dependability - attributes • Availability • Reliability • Safety • Confidentiality • Integrity • Maintainability BW p. 129
... Fault Error Failure Fault Dependability - impairments • Faults • Errors • Failures BW p. 103, ...,130
Dependability - means • Fault prevention • Fault tolerance • Error Removal • Failure Forecasting BW p. 106, ..., 130
byzantine Fault classification • physical (internal/external) • logical (design/interaction) • Origin • Kind • Property • omission • value • timing • duration (permanent, transient) • consistency (determinate, nondeterminate) • autonomy (spontaneous, event-dependent)
Error Classification (Fault Error) • Effect • Extent • latent • effective • local • distributed
Failure Classification (Fault Error Failure) • Consequence • benign • malign (a mishap) BW (Failure modes) p. 105
Fault Avoidance • process (activities) • notations • tools • Careful Design • Conservative Design • robust functionality • testability • tracability
Error Removal • Verification (analysis of design) • Test (analysis of implementation)
Failure Forecasting • Calculation – analysis of design • Simulation – measurement on design • Test -- measurement on implementation
Fault Tolerance Means to isolate componentfaults ... And mask them Prevents systemfailures May increase systemdependability
Dependability - means • Fault prevention • Fault tolerance • Error Removal • Failure Forecasting BW p. 106, ...
Full tolerance • Graceful Degradation • Fail safe FT - levels BW p. 107
Retry ... ... Try Try Try FT basis: Redundancy • Time • Space Try Retry BW p. 109
N-version programming V1 V3 V2 Comparison vectors (votes) Driver (comparator) Comparison status indicators Comparison points BW p. 109
byzantine Fault classification (scope of N-VP) + + (+) ++ (+) + / (+) + / + + / + • physical (internal/external) • logical (design/interaction) • Origin • Kind • Property • omission • value • timing • duration (permanent, transient) • consistency (determinate, nondeterminate) • autonomy (spontaneous, event-dependent)
Dynamic Redundancy • Error detection • Damage confinement and assessment • Error recovery • Fault treatment and continued service BW p. 114
D Error Detection f: State x Input State x Output • Environment (exception) • Application • Assertion: • precondition (input) • postcondition (input, output) • invariant(state, state’) • Timing: • WCET(f, input) • Deadline (f,input) BW p. 115
object I object I Damage Confinement • Static structure • Dynamic structure BW p. 117
Error Recovery • Forward • Backward Repair the state – if you can ! • define recovery points • checkpoint state at r. p. • roll back • retry Domino effect BW p. 118
Recovery blocks ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 } ... ELSE BY { module_m } ELSE ERROR BW p. 120
Failure exception Interface exception Request/response Interface exception Failure exception Request/response The ideal FT-component Normal mode Exception Handler BW p. 126
Safety Assessment Find faults that may lead to mishaps, analyze their relations, and estimate their consequences. May involve probabilistic reasoning (Reliability Engineering)
Primary Events: Basic event – fault in atomic component Undeveloped Event – fault in composite component (may be analyzed later) External event – expected event from environment Intermediate event: Nodes inside a fault-tree Fault Tree - Events
... ... Fault Tree - Gates condition Inhibit gate
Example – ”Wake too late” Wake too late ”Inner clock” fails Phone fails Alarm clock fails
Example ”Alarm clock fails” Alarm clock fails Power fails Beeper fails electronics fail Button fails SW fails Beeper not set Button read fails
Cut Set A cut set is a set of events that causes a top level event A singleton cut set is a single point of failure
Example – ”Wake too late” Wake too late ”Inner clock” fails Phone fails Alarm clock fails
Example ”Alarm clock fails” Alarm clock fails Power fails Beeper fails electronics fail Button fails SW fails Beeper not set Button read fails
Extensions etc. • Probabilities on edges • Event tree (forward analysis from initiating event) • Combinations (cause-consequence diagrams) • Many tools Kirsten M. Hansen, Anders P. Ravn and Victoria Stavridou, From Safety Analysis to Formal Specification, IEEE Trans. Softw. Eng.24,pp. 573-584, July 1998
Procedure • Model the correct component and check that it has the desired properties. • Model relevant faults and introduce them as internal transitions to error states. Check that this fault-affected. • Introduce into the model the mechanisms for fault detection, error recovery and masking and check that the desired properties are valid for this design.