Developing Dependable Systems

Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn

Software Dependability • Customers expect all software to be dependable. • They may accept some system failures in non-critical applications • Applications having high dependability requirements require special programming techniques

Achieving Dependability • Fault avoidance • software developed to minimize impact of human error • development process is organized so that faults in the software are detected and repaired before customer delivery • Fault tolerance • software designed so that faults in delivered software do not cause system failure

Fault Minimization • Current SE methods can produce fault-free software • Fault-free software merely conforms to its specification (it may or may not always perform correctly since the specification may be flawed) • The cost of producing fault-free software is very expensive and may only be justified in exceptional situations • It may be cheaper to accept some software faults

Developing Fault-Free Software • Needs a precise (preferably formal) specification • Requires an organizational commitment to quality • Information hiding and encapsulation in software design are essential • A programming language with strict type checking and run-time checking should be used • Needs a dependable and repeatable development process

Error Prone Constructs - part 1 • Floating-point numbers • inherently imprecise, frequent comparison errors • Pointers • Dangling references and aliases possible • Dynamic Memory Allocation • memory overflow and garbage problems • Parallelism • race conditions and deadlocks are possible

Error Prone Constructs - part 2 • Recursion • memory overflow when errors occur • Interrupts • errors are difficult to trace • Inheritance • code is no longer localized, unexpected results can arise when changes are made Note: You can use these constructs as needed, but you must be careful to use them correctly.

Information Hiding • Information should only be available to program components on a need to know basis • reduces the probability of accidental corruption of information • information is encapsulated to prevent error propagation to rest of program • since information is localized, programmer is less likely make errors and reviewers are more likely to find errors

Reliable Software Processes • Having a well-defined, repeatable software process will reduce the number of software faults • A well-defined repeatable process is one that does not depend entirely on individual skills, but can be carried out by a team • Significant verification and validation process activities must included to minimize the number of software faults.

Process Validation Activities • Requirements inspections • Requirements management • Model checking • Design inspections • Code inspections • Static code analysis • Test planning and management • Configuration management

Fault Tolerance • Required in critical applications (high reliability needed and high failure costs) • System can continue operation, despite software failure • A system which seems to be fault-free must also be fault tolerant (in case specification errors exist or the validation is incorrect)

Fault Tolerant Actions • Fault detection • system determines an incorrect system state has occurred • Damage assessment • determine system parts affected by fault • Fault recovery • system must restore its state to a known safe state • Fault repair • for a non-transitory fault, system is modified to prevent repetition

Approaches • Defensive Programming • programmers assume faults exist in system code • redundant code is written to check system state for consistency after modification are made • Fault Tolerant Architectures • HW and SW architectures that support redundancy are used • a fault tolerance controller that detects problems and supports recovery • Both approaches are important

Exception Management • Could be program error or an event like power failure • Exception handling facilities in programming languages allow exceptions to be handled without constant checking to detect them • Using normal control constructs to detect exceptions in a sequence of procedural calls adds considerable timing overhead to a program

Fault Detection • Languages with strict type checking allow many errors to be trapped during program compilation • Some types of errors can only be caught at run-time (e.g. cin >> I; cin >> A[I];)

Fault Detection Approaches • Preventative Fault Detection • fault detection mechanism is activated before a state change is committed • if an erroneous state is detected change is cancelled • Retrospective Fault Detection • fault detection mechanism is initiated after system state change has been made • used when correct sequence of actions can lead to erroneous system state or preventative fault detection has too much overhead

Type System Extension • Preventative fault detection really involves extending the current type system by including additional constraints as part of the type definition • These constraints are typically implemented by defining basic operations within a class definition

Damage Assessment • System is analyzed to judge the extent of corruption caused by a system failure • Must determine what parts of the state space have been affected by the failure • Generally based on “validity functions” which can be applied to the state elements to assess if their value is within an allowed range

Damage Assessment Techniques • Checksums are used to check for data transmission errors • Redundant pointers can be used to check integrity of data structures • Watch dog timers can help check for non-terminating processes (e.g. long time with no response assume the worst)

Fault Recovery • Forward Recovery • apply repairs to corrupted system state • usually application specific, requires domain knowledge • e.g. error coding like check sum added to data • Backward Recovery • restore system to known safe state • simpler, since archived safe state is used to replace erroneous state • e.g. use of checkpoints in WP editor

Fault Tolerant Architecture • Defensive programming can not cope faults caused by HW and SW interactions • If requirements are not understood then SW checks are not likely to be correct • Systems with high availability requirements often require fault tolerant architectures • Must tolerate both HW and SW failure

Hardware Fault Tolerance • Triple-modular redundancy (TMR) • Three replicated component are included in the system • If one component produces different output than the other two, failure is assumed • This idea is based on the notion that most failures result from component failures, not design faults • Component failures should be a low probability event

Software Fault Tolerance • TMR is based on two assumptions • HW components do not include common design flaws • simultaneous component failures are not likely • Neither assumption is valid for software components • isn’t possible to replicate SW components without replicating their design flaws • simultaneous component failure is inevitable • Software systems must be diverse

Design Diversity • Different versions of the system are designed and implemented different ways (so they should have different failure rates) • Different approaches to design • object-oriented and function oriented • different implementation languages • different algorithms in the implementation • different tools or environments

Software Analogies to TMR • N-version Programming • same specification is implemented in a number of different version by several teams • all versions compute simultaneously, the majority output is presumed correct • Recovery blocks • a number of explicitly distinct versions of a program are written for the same specification and executed in sequence • an acceptance test is used to select the output to keep

Problems with Design Diversity • Teams tend to tackle the same problems in the same ways, so the resulting implementations may not be diverse • Characteristic errors • different teams are likely make the same mistakes, since some parts of the implementation are more difficult than others • specification errors may cause the same errors to appear in all implementations (argument for developing multiple specifications)

Is software redundancy needed? • Unlike HW, SW faults are not an inevitable consequence of the real world • Some people believe that a higher level of reliability can be reducing software complexity instead • The existence of fault-tolerance controllers increases program complexity considerably and adds sources of errors that affect reliability

Developing Dependable Systems