380 likes | 569 Views
Design of Reliable Systems and Networks ECE 442/CS 436 Software Fault Tolerance N-Version Programming and Recovery Blocks. Prof. Ravi K. Iyer Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu.
E N D
Design of Reliable Systems and NetworksECE 442/CS 436Software Fault ToleranceN-Version Programming and Recovery Blocks Prof. Ravi K. Iyer Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign iyer@crhc.uiuc.edu http://www.crhc.uiuc.edu/DEPEND
Outline • Recovery blocks • N-version programming • Practice of high-availability system design • IBM mainframe example
Motivation for Software Fault Tolerance • Can increase software reliability via fault avoidance using software engineering and testing methodologies • Large and complex systems fault avoidance not successful • Redundancy in software may be needed to detect, isolate, and recover software failures • Both static redundancy and dynamic redundancy models exist • Hardware fault tolerance easier to assess • Software is difficult to prove correct HARDWARE FAULTS SOFTWARE FAULTS 1. Faults time-dependent Faults time-invariant 2. Duplicate hardware detects Duplicate software not effective 3. Random failure is main cause Complexity is main cause
Consequences of Software Failure • General Accounting Office reports $4.2 mission lost annually due to software errors • Launch failure of Mariner I (1962) • Destruction of French satellite (1988) • Problems with Space Shuttle and Apollo missions • STAR WARS (SDI) = funding billions of dollars for “correct” software development • AT&T blockages (error in recovery-recognition software)(1990) • SS7 (signaling system) protocol implementation - untested patch (mistyped character) (1997) • Therac 25 (overdose of medical radiation 1000’s of rads in excess of prescribed dosage)
Experiences with Current Software • Many computer crashes are due to software • Although recent data shows HW transients and design errors on the increase • Even though one expects software to be correct, it never is • Mature software exhibits fairly constant failure frequency • Number of failures is correlated with • Execution time • Code density • Software timing, synchronization points
300 250 FAIL RATE 200 Cumulative Test Failures (t) 150 100 50 0 5000 1000 2000 3000 4000 Number of Tests t Cumulative test failures (t) Experiences with Current Software (cont.) Key parameters and variables (with defect reintroduction) Defect Detection Time Constant s 17.2 Weeks Defect Repair Time Constant t 4.7 Weeks Code Delivery 589810 Lines Initial Error Density 0.00387 Defects per Line Defect Reintroduction Rate 33 Percent Deployment Time T Week 100 Estimated Remaining Defects ERDT 664 Defects Estimated Current Defects ECDT 445 Defects Testing Process Quality TPQT 90 Percent Testing Process Efficiency TPET 60 Percent
Difficulties • Improvements in software development methodologies reduce the incidence of faults, yielding fault avoidance • Need for test and verification • Formal verification techniques, such as proof of correctness, can be applied to rather small programs • Potential exists of faulty translation of user requirements • Conventional testing is hit-or-miss. “Program testing can show the presence of bugs but never show their absence,” - Dikstra, 1972. • There is a lack of good fault models.
Approaches to Software Fault Tolerance • ROBUSTNESS: The extent to which software continues to operate despite introduction of invalid inputs. Example: 1. Check input data =>ask for new input =>use default value and raise flag 2. Self checking software • FAULT CONTAINMENT: Faults in one module should not affect other modules. Example: Reasonable checks Watchdog timers Overflow/divide-by-zero detection Assertion checking • FAULT TOLERANCE: Provides uninterrupted operation in presence of program fault through multiple implementations of a given function
Recovery Blocks/N-Version Programming • Two relevant observations from 1830: • Anon. “The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers; and this check is rendered still more decisive if they make their computations by different methods.” • Charles Babbage “When the formula to be computed is very complicated, it may be algebraically arranged for computation in two or more totally distinct ways, and two or more sets of cards may be made. If the same constants are now employed with each set, and if under these circumstances the results agree, we may be quite secure of the accuracy of them all.”
Executive Environment (EE) J-th N-Version Software Unit EE Executive Support Functions Lane A Version 1 Lane A Lane B Version 2 Lane B Decision Algorithm Lane C Lane C Version 3 Software Unit Enhancements for Fault-Tolerant Execution N-Version Programming Basic Model The N-version software (NVS) model with n=3 Consensus Results
Software Unit Enhancements for Fault-Tolerant Execution Recovery Blocks Basic Model The Recovery Block (RB) Model EE Execution Environment (EE) J -th Recovery Block Software Unit Alternate 1 Accepted Results Recovery Cache Acceptance Test xi xij No No Yes Alternate 2 No Execution Support Functions Take Next Alternate
Execution Models for Software Fault-Tolerance Approaches Start software execution Start software execution Version #2 execution Version #N execution Version #1 execution Primary alternate execution ... End Execution Version #2 End Execution Version #N End Execution Version #1 End primary alternate execution Alternate selection Acceptance test execution Gathering versions results No alternate any more available Alternate #1 selected lz 2, ….N N Acceptance test not passed Start decision algorithm execution Acceptance test passed Decision algorithm execution Alternate #1 execution lz 2,…..N Failed software No acceptable result provided Acceptable result provided End alternate #1 execution End software execution Failed software End software execution Recovery blocks N-version programming
Execution Models for Software Fault-Tolerance Approaches (cont.) Start software execution Self-checking component #1 execution Self-checking component #2 execution ... Self-checking component #N execution No acceptable result No acceptable result Acceptable result provided Acceptable result provided Acceptable result provided N Result selected No result selected Failed software End software execution N self-checking programming
Software Fault-Tolerance Approaches and Their Equivalent Hardware Counterparts • RB is equivalent to the stand-by sparing(of passive dynamic redundancy) in HW fault-tolerant architectures • NVP is equivalent to N-modular redundancy (static redundancy) in HW fault-tolerant architectures • NSCP is equivalent to active dynamic redundancy • A self-checking component results either from: • The association of an acceptance test to a version • The association of two variants with a comparison algorithm • Fault-tolerance is provided by the parallel execution of N 2 self-checking components
Concepts of N-Version Programming • N 2 versions of functionally equivalent programs • “Independent” generations of programs carried out by N groups of individuals who do not talk to each other with respect to programming process (different algorithms, different programming languages, translation) • Initial specification formally done in some formal spec. language • states unambiguouslythe functional requirements • leaves widest possible choice of implementation • By making the development process diverse it is hoped that the versions will contain diverse faults • The inventors of NVP emphasized that: • “the definition of NVP has never postulated an assumption of independence and that NVP is a rigorous process of software development”
An Assumption of Independence in N-Version Programming ? • Do the N versions of a program fail independently (similar to hardware)? Are faults unrelated? Does Prob (failure of N-version system) = Prob (failure of one version)N?? • If so, then the system reliability can be very high • Why such an assumption may be false? • People make same mistakes, e.g. incorrect treatment of boundary conditions • Some parts of a problem more difficult than others • statistics show similarity in programmer’s view of “difficult” regions
Observation from Experiments • Assumption of independence of failures of versions DOES NOT hold • This does not mean N-version programming is not useful • The reliability of the system will not be as high as in the case when the faults in different versions are independent • Example: PODS (Project on Diverse Software) • All faults were caused by omissions and ambiguities in the requirement specifications • Two common faults were found in two versions • Three different versions of software with failure rate 1.5 * 10-6, 0.8 * 10-3, and 0.8 * 10-3, resulted in the failure rate of 0.8 * 10-3 after majority voting • The common/coincident faults could not be excluded by majority voting
Limitation of N-Version Programming • All N -versions originate from the same initial specifications whose correctness, completeness, and unambiguity should be assumed • Use formal correctness proofs on specs, rather than proofs on implementations • Exhaustive validation • Based on an assumption that software faults are distinguishable: • faults that will cause disagreement between versions at specified voting points might be a result of independent programming efforts to remove identical software defects
Concepts of Recovery Blocks • Characteristics: • Incorporates general solution to the problem of switching to spare • Explicitly structures a software system so that extra software for spares and error detection does not reduce system reliability • First to consider a single sequential process; later extended to • Multiple processes within one system • Multiple processes in multiple systems => distributed recovery blocks • Can view progress as sequences of basic operations, assignments to stored variable • Structured program has BLOCKS of code to simplify understanding of the functional description • Choose blocks as units for error detection and recovery.
Alternates • Primary alternate is the one that is to be used normally • Other alternates attempt less desirable options • One source of alternates is earlier release of primary alternates • Gracefully degraded alternates
Acceptance Tests • Function: ensure the operation of recovery blocks is satisfactory • Should access variables in the program, NOT local to the recovery block, since these cannot have effect after exit. Also, different alternates use different local variables. • Need not check for absolute “correctness” - cost/complexity trade-off • Run-time overheads should be LOW • NO RESIDUAL EFFECTS should be present, since variables, if updated, might result in passing of successive alternates
Restoration of System State • Restoring system state is automatic • Taking a copy of entire system state on entry to each recovery block is too costly • Use Recovery Caches or “Recursive” Caches • When a process is to be backed up, it is to a state just before entry to primary alternate • Only NONLOCAL variables that have been MODIFIED have to be reset
Process Conversations • A systematic methodology of extending recovery blocks across processes by taking process interactions into considerations (considers time/space)
Process Conversations (cont.) • Recovery block spanning two or more processes is called a conversation • Within a conversation, processes communicate among themselves, NOT with others • Operations of a conversation • Within a conversation, communication is only among participants, not external • On entry, a process establishes a checkpoint • If an error is detected by any process, then all processes restore their checkpoints • Next to ALL processes execute their available alternative • All processes leave the conversation together (perform their acceptance tests just prior to leaving) • At the end of the conversation, ALL processes must satisfy their respective acceptance tests, and none may proceed otherwise
P1 P2 P3 Nested Conversions Checkpoint Inter-process communication Acceptance test Conversation boundary
Comparison of Recovery Blocks vs. N-Version Programming • Advantages of Recovery Block • Most software systems evolve by replacement of some modules by new ones - can be used as alternates • Nice hierarchical design - structured approach • Disadvantages of Recovery Block • System state must be saved before entry to recovery block -- excessive storage • Difficult to handle multiple processes -- might have domino effect • Difficult to undo effects in real-time systems • Effectiveness of acceptance test • Higher coverage is more complex • Lack of formal method to check
Comparison of Recovery Block vs.N-Version Programming (cont.) • Advantages of N-Version Programming • Immediate masking of software faults -- no delay in operation • Self-checking (acceptance tests) not required • Conventional fault tolerant systems HW and SW have redundant hardware e.g. TMR (easier to include N-version software on redundant hardware) • Disadvantages of N-Version Programming • How to get N-versions? • Impose design diversity, since randomness does not give uncorrelated software faults • Extremely dependent on input specifications (formal correctness proofs…)
IBM 30xx Simplified System Model Expanded storage Central memory System controller Processor controller CPU’s Power distribution and cooling Channel control Channel adapters and servers
Fault Isolation Using Hardware Checkers • Error checker placement determined by Fault Isolation Domains (FID) • Checkers define the boundary of fault containment • If checkers detect all faults, fault isolation stop the processor, identify checker i.e. the relevant FID and possibly Field Replacement Unit (FRU)
Memory array 2 Memory array 1 Register C Register B Register A Drivers Fault Isolation Using Hardware Checkers FRU 1 If checker 2 is triggered and if register C is input to register B implicated set FRUs is 3,4 Checker 1 FRU 4 FRU 2 Cable FRU 3 FRU 5 ... Decoder Red = Fault Isolation Domain Blue = Field Replaceable Unit Checker 2 Checker 3
Mapping of Fault Isolation Domains to Field Replaceable Units Function FID FRU Syndrome Memory array 1 1 1 C1 Register A 1 1 C1 Checker 1 1 1 C1 Drivers 2 1 C2 Cable 2 2 C2 Memory array 2 2 3 C2 Register B 2 3 C2 Checker 2 2 3 C2 Register C 2 4 C2 Decoder 3 5 C3 Checker 3 3 5 C3
P P P Primary data stager I/O processor Secondary data stager Secondary data stager IBM 30XX Data Path Overview Expanded storage Storage controller with hardware- assisted memory tester ECC ECC Central storage Storage controller with hardware- assisted memory tester ECC P ECC P System controller Processor controller CPU Channel control element Cache P P Instruction Fetch/decode Vector execution P = parity Instruction execution P LSSG = Logic Support Station Group Channel adapter LSSG P Control Storage Parity Channel server
Hardware-Based Retry Instruction Execution Errors are detected by parity checks on register contents and on data buses and by pattern validity checks in control logic circuits. Instruction execution Operands into retry buffers Error detected ? No Yes Instruction and execution elements Freeze execution Stop on error and restore operands Communicate back to processor controller through LSS Get instructions/data from retry buffers Test No retry or threshold crossed Retry permitted ? Signal OS for SW recovery Instruction retry Restart execution
Checkers in The Central Processor • Byte parity on data path registers • Parity checks on input/output of adders • Parity on microstore • Parity on microstore addresses • Encoder/decoder checks • Single-bit error detection in cache for data received from memory • Additional illegal pattern checks
Levels of Error Recovery System operation Machine checkInterruption System supported restart Functional recovery System recovery System repair System continues System continues Task terminated Successful 1 Perform instruction retry Unsuccessful System reloaded 2 Terminate affected task and continue system operation Successful Unsuccessful Notify operator external repair Successful 3 Restart system operation, stop for repair not required Unsuccessful 4 Stop, repair, & restart
Recovery Processing Overview Handling Hardware and Software Errors ABEND (AbnormalTermination) CONTROL RECOVERY TERMINATION MANAGER PROGRAM RETRY ROUTINES RECOVERY ROUTINES TERMINATION ROUTINES
System Level Facilities for Error Detection and Recovery • Installation error detection capability • Tools to build “profiles” of system software modules and inspect correct usage of system resources. • Software facilities to detect the occurrences of selected events, e.g., appendages allow user control of I/O; SLIP (serviceability level indication processing) aids in error detection and diagnosis (e.g., access to traps that cause a program interruption). • User defines detection mechanisms to detect programmer-defined exceptions, e.g., incorrect address or attempting privileged instructions. • The operator detects evident error conditions, e.g., loop conditions, endless wait states • The data management and supervisor routines ensure valid data is processed and non-conflicting requests are made