1 / 43

Software Fault Tolerance – The big Picture

Software Fault Tolerance – The big Picture. RTS April 2008 Anders P. Ravn Aalborg University. Fault Tolerance. Means to isolate component faults. Prevents system failures. May increase system dependability. Dependability - attributes. Availability Reliability Safety

laraine
Download Presentation

Software Fault Tolerance – The big Picture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Fault Tolerance –The big Picture RTS April 2008 Anders P. Ravn Aalborg University

  2. Fault Tolerance Means to isolate componentfaults Prevents systemfailures May increase systemdependability

  3. Dependability - attributes • Availability • Reliability • Safety • Confidentiality • Integrity • Maintainability BW p. 129

  4. ... Fault Error Failure Fault Dependability - impairments • Faults • Errors • Failures BW p. 103, ...,130

  5. System and Component

  6. Dependability - means • Fault prevention • Fault tolerance • Error Removal • Failure Forecasting BW p. 106, ..., 130

  7. byzantine Fault classification • physical (internal/external) • logical (design/interaction) • Origin • Kind • Property • omission • value • timing • duration (permanent, transient) • consistency (determinate, nondeterminate) • autonomy (spontaneous, event-dependent)

  8. Error Classification (Fault  Error) • Effect • Extent • latent • effective • local • distributed

  9. Failure Classification (Fault  Error  Failure) • Consequence • benign • malign (a mishap) BW (Failure modes) p. 105

  10. Fault Avoidance • process (activities) • notations • tools • Careful Design • Conservative Design • robust functionality • testability • tracability

  11. Error Removal • Verification (analysis of design) • Test (analysis of implementation)

  12. Failure Forecasting • Calculation – analysis of design • Simulation – measurement on design • Test -- measurement on implementation

  13. Fault Tolerance Means to isolate componentfaults ... And mask them Prevents systemfailures May increase systemdependability

  14. Dependability - means • Fault prevention • Fault tolerance • Error Removal • Failure Forecasting BW p. 106, ...

  15. Fault Tolerance

  16. Full tolerance • Graceful Degradation • Fail safe FT - levels BW p. 107

  17. Retry ... ... Try Try Try FT basis: Redundancy • Time • Space Try Retry BW p. 109

  18. N-version programming V1 V3 V2 Comparison vectors (votes) Driver (comparator) Comparison status indicators Comparison points BW p. 109

  19. byzantine Fault classification (scope of N-VP) + + (+) ++ (+) + / (+) + / + + / + • physical (internal/external) • logical (design/interaction) • Origin • Kind • Property • omission • value • timing • duration (permanent, transient) • consistency (determinate, nondeterminate) • autonomy (spontaneous, event-dependent)

  20. Dynamic Redundancy • Error detection • Damage confinement and assessment • Error recovery • Fault treatment and continued service BW p. 114

  21. D Error Detection f: State x Input  State x Output • Environment (exception) • Application • Assertion: • precondition (input) • postcondition (input, output) • invariant(state, state’) • Timing: • WCET(f, input) • Deadline (f,input) BW p. 115

  22. object I object I Damage Confinement • Static structure • Dynamic structure BW p. 117

  23. Error Recovery • Forward • Backward Repair the state – if you can ! • define recovery points • checkpoint state at r. p. • roll back • retry Domino effect BW p. 118

  24. Recovery blocks ENSURE acceptance_test BY { module_1 } ELSE BY { module_2 } ... ELSE BY { module_m } ELSE ERROR BW p. 120

  25. Failure exception Interface exception Request/response Interface exception Failure exception Request/response The ideal FT-component Normal mode Exception Handler BW p. 126

  26. Safety Assessment Find faults that may lead to mishaps, analyze their relations, and estimate their consequences. May involve probabilistic reasoning (Reliability Engineering)

  27. Primary Events: Basic event – fault in atomic component Undeveloped Event – fault in composite component (may be analyzed later) External event – expected event from environment Intermediate event: Nodes inside a fault-tree Fault Tree - Events

  28. ... ... Fault Tree - Gates condition Inhibit gate

  29. Example – ”Wake too late” Wake too late ”Inner clock” fails Phone fails Alarm clock fails

  30. Example ”Alarm clock fails” Alarm clock fails Power fails Beeper fails electronics fail Button fails SW fails Beeper not set Button read fails

  31. Cut Set A cut set is a set of events that causes a top level event A singleton cut set is a single point of failure

  32. Example – ”Wake too late” Wake too late ”Inner clock” fails Phone fails Alarm clock fails

  33. Example ”Alarm clock fails” Alarm clock fails Power fails Beeper fails electronics fail Button fails SW fails Beeper not set Button read fails

  34. Extensions etc. • Probabilities on edges • Event tree (forward analysis from initiating event) • Combinations (cause-consequence diagrams) • Many tools Kirsten M. Hansen, Anders P. Ravn and Victoria Stavridou, From Safety Analysis to Formal Specification, IEEE Trans. Softw. Eng.24,pp. 573-584, July 1998

  35. Example

  36. Fault Hypotheses

  37. Fault-Tolerant System

  38. Impulse Generator

  39. CU

  40. Voter and Arbiter

  41. Parameters

  42. Properties

  43. Procedure • Model the correct component and check that it has the desired properties. • Model relevant faults and introduce them as internal transitions to error states. Check that this fault-affected. • Introduce into the model the mechanisms for fault detection, error recovery and masking and check that the desired properties are valid for this design.

More Related