170 likes | 295 Views
Sequence of Quality Related Activities. Error/Fault Prevention Fault detection and Removal Fault Tolerance/Containment. Fault Tolerance & Failure Containment.
E N D
Sequence of Quality Related Activities • Error/Fault Prevention • Fault detection and Removal • Fault Tolerance/Containment
Fault Tolerance & Failure Containment • After extensive Quality Engineering activities of Inspections, Testing, Formal Correctness Proof, etc. there may still be some defects remaining in the software system. • To keep the software systems operational we need to consider the strategies of: • Tolerating “local” faults • Containing the failure damage from spreading I recently heard the term “self-healing” software --- more than fault tolerant
General Concept • Both Fault Tolerance and Failure Containment are practiced in many other disciplines: • Mechanical systems use duplicative systems to back-up and replace any faulty equipment – the cost is the duplicate hardware which replaced the faulty hardware • Chemical systems use containment mechanismto locally enclose/limit a leakage or melt-down damage– the cost is the containment “wall” or “closure” employed to limit the damage.
Examples in Software Systems • Network (single node virus) may be contained/tolerated: • via network re-routing algorithms that cordons off this node & provide alternative routes to the destination --- albeit slower • Duplication of route ( multiple and possibly duplicative connections) • Containment of failure (slower/reduced-performance but eventually gets there) • Device control mechanism fault may be tolerated • Via a back-up mechanism that captures the check-pointed information and restart the processing from the last check-point ---- albeit duplicating some processing • Duplication of processing (re-processing from the last check point – of data) • Containment of failure (everything processed prior to the check-point stands good)
2 Important Assumptions for Fault Tolerance • Rare Event : The failure is rare and the probability of failure is very low; thus it is impossible to anticipate and thusneeds fault tolerance and failure containment considerations. (e.g. elevator control software goes into stop state whenever “anything” goes wrong.) • Failure Independence: Different components of the system fail independently of one another and can be localized; thus the localized mechanism may be replaced and/or the failure may be contained. (e.g. different valve control programs in a process control software --- if one fails, then we may want to shut down part/whole plant until the specific failing control program is fixed. If these control programs are all linked or coupled, then we may have a bigger problem ) ** Note that the second assumption is also why we promote “loose coupling” in software design
Techniques Classification • Fault Tolerance techniques: • Duplication: • We may use multiple, parallel processing and picking a consensus solution – (n-version programming) • Backup/Recovery: • We have a software (algorithm and db) running with regular checkpoints to backup the information processed – when a problem is encountered, then the software may recover by going back to the last check pointed data and reprocess • We have a primary and secondary software (programs) and when the primary software fails, the secondary (less functional) program may be swapped in to bring the processing to some degraded state. (*** This sometimes happens when the main operating system is attacked by a virus and a skeleton operating system is rolled in.***)
Techniques classification • Failure Containment techniques • Failure analysis for containment: • We focus on analysis of potential, preconditions of failure/damage and set different mechanisms for accident reduction/ containment/ control once a failure does occur • Damage control: • We focus more on how to limit the damage and severity of the accidents once the accident occurred; since the damage and severity is domain specific, the containment mechanism is also domain specific (e.g. chemical hazards, mechanical safety, etc.)
Fault Tolerance Based on “Multiple Computation” • The notion of fault-tolerance via employing multiple-computation (duplication) is used by both hardware and software systems. The general notion of multiple computation includes backup/recovery and touches upon 3 domains: • Time • Hardware • Software • The general notation (from Avizienes) is - nT/nH/nS • We may repeat the processing n-Times • We may duplicate processing on n-Hardware • We may use n-versions-of-Software
More on nT/nH/nS • Consider nT/1H/1S: • This is the situation where processing is performed several times on one hardware, using the same version of software. • Back-up the information and recover by reprocessing from the last checkpoint is an example of multiple processing over time. • Consider 1T/nH/1S: • This is the situation where multiple, duplicative hardware is used with the same version of the software, and it comes in two “flavors”: • replication where two or more hardware(may not be the same kind) is running the same software in parallel and some algorithm is used to pick the “correct” output such as the “majority-vote” in Triple-Modular-Redundancy • redundancy where multiple identical instances of the same system is provided but only one is running and switching to another when the processing one fails.
More on nT/nH/nS • Consider 1T/nH/nS: • This is the case where we have multiple versions of software running on multiple hardware providing possibly different outputs. The key is the “decision algorithm” that will determine what is the “right” output. • This case will also need a sophisticated operating system or runtime tool to gather the outputs fro the multiple, possibly different hardware and software. • Consider 1T/1H/nS: • This is the case where we have multiple versions of software running on the same hardware, providing possibly different outputs. The key here, again, is the “decision algorithm” that determines the “correct” output. • This requires the compiler and runtime tool that will facilitate multiple, parallel processing
N-version Programming • N-version programming is a fault tolerant technique that was introduced by Avizienes and Chen based on the notion of “multiple computing.” The general scheme works as follows: • There are multiple, n, independent versions of program that performs the identical functionality • The same input is distributed to all n versions • The individual outputs from all n versions are fed to a “decision box” • The “decision box,” using some algorithm, chooses the appropriate answer as the output
N-version Programming Version 1 Decision “Box” Version 1 Input . . . Output Version n Note: that this may be 1T/1H/nS or 1T/nH/nS
N-version Programming • The “DecisionBox” algorithm is an important factor in this approach. • The decision algorithm is often based on the assumption that the faults in n-versions are independent (the earlier mentioned “failure independence”) • This assumption says that if the faults are independent then it is likely that any one fault is local to a version and the other versions may be processing correctly with respect to this one fault, even though other versions may have other faults of their own. • One popular algorithm is the “simple-majority” rule and uses the answer of majority. • Note that this assumption of majority is correct is not a “guarantee” ------ what if the majority were wrong!?
Facilitating N-version Programming • A way to ensure and make N-version Programming more reliable is to get to fault independence through version independence: • Use diverse people to develop the different versions • Use different development processes to develop the different versions • Use different technology, tools, programming languages, methodologies,etc. to develop the different versions • - So far, N-version Programming is found to be quite costly! • What is a reasonable N ---- 3, 4, 20? • ***can we use N-version Programming for security attack tolerance?***
Failure Containment • With all the fault prevention and fault tolerant techniques, unfortunately, we will still have faults. In that case, can we (a) “prevent severe accidents” and can we (b) “reduce the damage of the accidents”? • We already know that we can not prevent all accidents. But we can analyze the hazards of an accident and hopefully contain or limit the damage.
Fault Tree Analysis for failure analysis1. list the set of events that cause the “accident” or “failure” 2. build a upside down tree that logically connects the events to the failure Security Break in F1 AND Log-in Granted Access to F1 “bug” Or Password Exposed Log-in validation “bug” • - top event is the “accident” • circles are primary events • AND/OR are logical conditions
Containment • We use the fault-tree to analyze and understand the cause of the “accident.” We may use it for: • Accident elimination • Accident reduction • Accident control • The actual containment is a solution that is domain dependent and requires “domain specific” knowledge.