Software Fault-Tolerance Techniques from a Real-Time Systems Point of View by: Martin Hiller

Software Fault-Tolerance Techniques from a Real-Time Systems Point of View by: Martin Hiller Kevin Peelman Rob Hunt Chuck Agnew December 1st 2003

Outline • Introduction & Terminology • Chuck Agnew • Faults and Fault Tolerance • Chuck Agnew • Real-Time Systems Concepts • Chuck Agnew • Introduction to Software Fault Tolerance Techniques • Rob Hunt • Software Fault Tolerance Techniques (part 1) • Rob Hunt • Software Fault Tolerance Techniques (part 2) • Kevin Peelman • Conclusion • Kevin Peelman

Introduction • “Software Faults are always due to flaws in the design of the system.” • Old methods for Fault Tolerance utilized a backwards recovery scheme which is useless for real-time systems due to time constraints and changes in operation state. • Dependability of a computer system can be achieved by four means • Fault prevention  prevention of faults occurrences • Fault tolerance  providing services when faults occur • Fault removal  minimizing the presence of faults • Fault forecasting  estimating the manifestation of faults

Topic Terminology • What is dependability? • “Defined as the trustworthiness of a system such that reliance can justifiably be placed on the service it provides” • Dependability has 4 main attributes: • Availability  measurement of a system’s readiness for usage • Reliability  probability that a system will not fail during operation • Safety  avoidance of catastrophic events on the environment • Security  system’s extent to prevent unauthorized access • Other related terms: • Failure  the system can no longer provide its specified service • Error  system state which leads to a subsequent failure • Fault  Problematic cause of an error  How do we classify faults?

Fault Classification Faults • Nature of faults distinguishes the intention of the fault Nature Origin Persistence Phenomenon Extent Phase Accid-ental Intent-ional Physical Human-made Ext. Int. Design Operation Perm-anent Temp-orary • Origin of faults are categorized into 3 types: • Phenomenon – is the fault from physical or human phenomenon • Extent – does the internal or external environment cause the fault • Phase – is the fault caused within the design or operation of the system • Persistence of faults determine the duration of the fault state

Fault Tolerance • What is Fault Tolerance? • Ability of an operational system to tolerate the presence of faults • Why tolerate faults? • It is proven that it is impossible to completely test a practical-sized system. • Therefore, it is important to implement techniques which allow a system to detect and tolerate faults during normal operation. • 4 phases of fault tolerance: • Error detection – detection of an erroneous state • Damage assessment – computes the severity of the fault • Error processing – substitute erroneous state for an error-free one • Fault Treatment – determine the cause of the error, then run fault passivation to ensure it doesn’t happen again

Real-Time Systems Concepts • Real-Time systems are characterized by 3 major components: • Time Most important resource of a real-time system  Correctness of a system depends not only on results of computations, but the time which the result happens • Environment  Active component of a real-time system  Consist of a controlling system and a controlled system. The controlled system (external process) is the environment of the controlling system. • Reliability  Crucial due to possible catastrophic events if a real- time system fails.  If a failure does not jeopardize the safety of the environment, then it is called fail-safe.

Software Fault Tolerance Techniques • Key to fault-tolerance is redundancy • Three domains: • Space • Several hardware channels each executing same task • Information • Recover the system via data structures storing system contents • Repetition • Restarts module in event of a faulty module • Two major schemes have evolved • Recovery Block (RB) • 1H/Nds/NT-System • Faults tolerated by executing diverse software modules sequentially • N-Version Programming (NVP) • NH/Nds/1T-System • Faults tolerated by identical hardware channels executing diverse software modules concurrently

Software Fault Tolerance Recovery Block Primary Passed Switch Acceptance Test Checkpoint Alternate 1 Alternate 2 Failed . . . Alternate N-1 True Restore from Checkpoint Fault More alternates? Deadline not exceeded?

Software Fault Tolerance Recovery Block • Considerations • Software diversity • Idea: different teams, one specification, different products • Hope that failure domains do not overlap • Difficulties in designing acceptance test • Single test for all modules of recovery block • Test is most crucial element in improving reliability • Design of Recovery Cache • sufficiently simple to ensure no faults • Increased System Overhead • Domino Effect • Recovery blocks can push concurrent tasks that communicate into uncontrolled rollback

Software Fault Tolerance N-Version Programming • N-Version Programming ( NH/Nds/1T ) • Several Hardware channels • “Software diverse” versions of code • Results are voted upon • Initial Specification is crucial Version 1 Switch Synch Version 2 Majority Agreement Output . . . . . Voter Version N Failure No agreement

Software Fault Tolerance N-Version Programming • Considerations • Software diversity! • Difficult to create good specification • Decision Mechanism • Some results will not always be identical (valid and invalid) • define a range of valid solutions but decreases distance from acceptance test approach • System Overhead • temporal: Synchronization and decision algorithm • space: multiple hardware channels and space for multiple software versions • Extensions • Community Error Recovery ( forward recovery) • enough information from good versions to recover failed versions

Software Fault Tolerance Consensus Recovery Block(CRB) • NH/Nds/1T • Synthesis of N-version Programming and recovery block • Basic Assumption: • no similar errors will occur (erroneous results resembling each other) • if two or more versions agree, the result is considered correct Version 1 Input Switch Version 2 Output Agreement . . . . . Voter Version N No agreement AT Failure Versions untried? Time limit not expired?

Software Fault Tolerance Distributed Recovery Block • NH/NS/1T or Nhs/Nds/1T • Reproducing RB Scheme on Multiple Network Nodes • Considerations • Synchronization between nodes especially during rollback Version A Acceptance Test Accepted Version B Input More alternates? Deadline not exceeded? False True Primary Node Version A Acceptance Test Accepted Version B More alternates? Deadline not exceeded? Failed True Secondary Node

Extended Distributed Recovery Block Supervisor • Heartbeat scheme • Active Node • Shadow Node • Supervisor Node • Each node contains • Primary version • Alternate version • Acceptance test • Device Drivers Recovery Manager Heartbeat/Reset Request Consent Active Node Exec. Node Exec. Shadow Heartbeats Primary Version Alternate Version Alternate Version Primary Version Acceptance Test Acceptance Test Device Drivers Device Drivers To the system To the system

Roll-Forward Checkpointing Scheme • Used for multiprocessor systems • Pool of Active Processing Modules • Processor • Volatile storage • Stable storage • Checkpoint processor • The checkpoint processor detects module failures by comparing the state of each pair of processing modules that perform the same task. • The two processors execute their tasks, checkpoint their states, and send the checkpoints to the checkpoint processor. • The checkpoint processor compares the states, and if the states match the new checkpoint is considered correct and it replaces the old checkpoint.

N Self-Checking Program • Made up of several Self Checking Components • Made up of different variants • Variants are either associated with an acceptance test or paired together and associated with a comparison algorithm • Components execute in parallel • Fault tolerance is provided by parallel execution of components • Each component is responsible for determining whether a delivered result is acceptable

Data Diversity • Retry Block • Executes test normally • If the results are accepted by the test, execution is complete • If the results are not accepted the test runs again once the input data has been restated • N-copy Programming • Upon entry to the block, data is restated to N-1 ways • This creates N different data sets • The copies execute in parallel • Output is selected with a voting scheme

Conclusion • Obviously, fault tolerance systems cost more than simplex systems • Fault tolerant design considerations • Anticipated faults • In most cases, a simple acceptance test is all that is needed • Unanticipated faults • Designers must decide what is the most practical solution • Most of the techniques in this report are hardware based, and many designers will not be able to use them • This leaves designers with • Recovery Blocks (Software Design Diversity) • Retry Blocks (Data Diversity)

Questions?

Software Fault-Tolerance Techniques from a Real-Time Systems Point of View by: Martin Hiller