671 likes | 1.23k Views
Systematic Failure and Safety Integrity Levels. Overview. What is systematic failure? How does it arise? How can it be defended against? Safety integrity Integrity levels DEF (AUST) 5679 Integrity Levels Civil Aerospace (ARP 4754) Development Assurance Levels DO-178B
E N D
Overview • What is systematic failure? • How does it arise? • How can it be defended against? • Safety integrity • Integrity levels • DEF (AUST) 5679 Integrity Levels • Civil Aerospace (ARP 4754) Development Assurance Levels • DO-178B • Independence requirements • Implications of SILs / DALs for development and assessment • Motor Industry (MISRA) SILs • DO178B
Classes of Failure • Random – failures due to physical causes – a variety of degradation mechanisms • Wearout – specific class of failures when item of limited life has worn out • Systematic – failures due to flaws in system. Systems subjected to the same conditions fail consistently – but even random and wearout failures are result of design decisions
Definition of Systematic Failure Systematic failure is unwanted behaviour which • is repeatable • if conditions can be exactly replicated • is predictable (but not accurately) • all systems have flaws • is indefensible • it should not occur... • … but it is extremely hard to prevent • arises from human error • misconceptions • mistakes
Sources of Systematic Failure Systematic failures may arise from • Concept (“a bad idea”) • inherently flawed • Specification(“building the wrong thing”) • safety is with respect to intent • Design or manufacture (“building the thing wrong”) • including third-party contributions (COTS) • Use and maintenance • mistakes • poor / unworkable procedures • violating designers’ intentions or expectations • Change • the challenge of legacy systems
Cause by Lifecycle Phase • HSE • Similar figures from other studies
Preventing Systematic Failures 1 Actions to identify and remove errors • Design reviews • Certified tools • Testing • (For software) formal methods – proof and refinement These will help to demonstrate conformance to specification – but remember safety is with respect to intent • need techniques to reconsider intent after completion of design proposal
Preventing Systematic Failures 2 Techniques which can help consider design w.r.t intent • creative • “informal” (not unstructured) • e.g. HAZOP The bottom line: No non-trivial system can be guaranteed free from error Must have an expectation of failure, and make appropriate provision • error handling, redundancy, diversity
Coping with Systematic Failures 1 Measures to deal with systematic failure depend on • System requirements • acceptability of failure modes, e.g. • Fail stop – rarely acceptable (for whole system) • Fail safe – nuclear systems, manufacturing plant, ABS • Degraded operation – aircraft navigation, autopilot • “Failure free” – flight controls • availability • physical limitations (size, weight, power) • Cost Management of systematic failure is one trade-off dimension • ALARP
Coping with Systematic Failures 2 For anticipated failure modes, can • provide handlers / fallback mechanisms • argue safety (failure will not cause system hazard) by analysis of completed system More difficult problem is unanticipated failures • can attempt to provide general mechanisms, but • how effective will they be? • how much complexity will they add? • will they themselves cause failures? • in general will not be able to argue safety from product – will need process arguments (improbability of having built in errors) • Integrity Levels or Development Assurance Levels help determine appropriate process
Vehicle Speed Sensing • Engine speed and load from EMS used as diverse “sanity check” on vehicle speed input to gearbox controller • speed sensor failure is anticipated failure mode • diversity may also protect against unanticipated failures
System Architectures Simplex systems • Highly reliable components Dual Systems • Twin identical • Twin dissimilar • Control + Monitor N-way Redundant Systems • Identical / dissimilar • Self-checking / voting
Redundancy and Diversity Redundant components offer protection against failures • probability of multiple failure is less than single failure provided they are independent • if components are identical, systematic failures (design flaws) are source of common mode failure Diverse (dissimilar) implementations offer (some) protection against systematic failure • even if all variants contain flaws, they are less likely to fail simultaneously Can identify two classes of diversity: • conceptual • mechanistic
Aircraft Fuel Gauging Two alternative ways of calculating remaining fuel: • Depth of fuel in tanks and tank geometry • Fuel loaded minus fuel used (measured by flow metering) Gives diverse concept, algorithms and hardware • High degree of protection against systematic failure
Problems with Diversity Mechanistic diversity provides limited protection against systematic failure • Specifications for all versions will be the same • People (are trained to) think alike • given identical (correct) specifications, similar errors in product are highly likely • several studies (e.g. Knight & Leveson) on n-version programming have demonstrated this effect However • Can avoid systematic errors introduced by tools • Timing of failures likely to be different – may be sufficient in some systems
Controller and Monitor • Checks outputs • Suited to responsive systems • Slower error detection • Design effort similar to non- identical channels • May offer more protection against systematic failures (conceptually more diverse) Dual Systems Dual Channel • Checks during processing • Suited to continuous systems • Fast error detection • Design effort and protection against systematic failures depend on whether channels are identical
Dual Channel Example Anti-lock brake system • calculate when to release brakes in response to wheel lock-up • minute differences in data sampling can have large effect upon algorithm – identical channels compare data and results to minimise this • provides protection only against hardware failure • Safest action = do not release brakes NB – high level of communication between channels carries risk of propagating failure to healthy channel
Control + Safety Interlock Example Missile launch control from aircraft • Controller • initiates launch actions • provides target and track information to missile • Monitor • provides safety interlock against inadvertent launch • Safest action = no launch
Triple Modular Redundancy • TMR can handle any failure of a single channel • Many variations on TMR • Biggest problem is voter • single point of failure • comparison itself may be hard to define – what thresholds should be set?
Complex Architectures Many variants are possible • Common problems are • voting • synchronisation • With communicating channels, may be susceptible to Byzantine Failure Simple failure of A • B and C receive “A not OK” from A “Malicious” failure of A • B receives “A not OK” from A • C receives “A OK” from A
Safety Integrity • Definition: Likelihood of a safety related system satisfactorily performing the required safety functions under all the stated conditions within a stated period of time • Another definition: Freedom of safety functions from flaws • Two elements • Integrity w.r.t: random failure • random failures in a dangerous mode of failure • Due to components wearing out etc. • Integrity w.r.t: systematic failure • Systematic failure in a dangerous mode • flaws in specification • flaws in design • unanticipated environmental influences • Systematic failures typically dominate the achieved failure rate in complex computer based systems Hardware Hardware & Software
Safety Integrity Levels • The problem • random failures can be designed against • systematic failures are not known in advance + any further design may introduce further errors • Safety integrity level (Development Assurance Level) is an indication of the required level of protection against failure • i.e. integrity level indicates the degree to which a component must be free from flaws • Principles of Integrity Levels • allocate them early in the lifecycle to systems, functions and components • allocated integrity levels then define the appropriate degree of rigour and scrutiny applied in the development • intuition is that better development techniques will result in the inclusion of fewer systematic flaws
DEF(AUST) 5679 Integrity Levels • Assignment of required Levels of Trust (LoT) to System Safety Requirements (SSR) • Level of Trust (LOT) = measure of level of confidence that the system meets that SSR • Set as requirement
Level of Trust Assignment • Default LOT assigned to SSR using following table: • If Prob(Accident) given Prob(Hazard) can be determined, LOT can be lowered
SIL Assignment to CSRs • Seven SILs (S0-S6) corresponding to Seven LOT • SIL assigned to component safety requirement (CSR) depends on protective measures in design for reducing chance of corresponding hazard • No measures = SIL directly mapping from LOT set for CSR • Hard to provide quantitative argument re: protective measures, therefore qualitative argument expected • CSR SIL can be no less than two levels lower than CSR LOT • SIL assigned to a Component relating to operator procedure not normally be higher than S3 • Justification of exceptions required • Considered undesirable to assign SIL > S4 to a CSR • NB also Requirement for Independence
SIL Influence - Examples • Safety Integrity Design Attributes • Safety Integrity Level Attributes for Software
DO178B Software Level Definitions • Level A - Software whose anomalous behaviour, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a catastrophic failure condition for the aircraft • Level B - Software whose anomalous behaviour, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a hazardous / severe-major failure condition for the aircraft • Level C - Software whose anomalous behaviour, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a major failure condition for the aircraft • Level D - Software whose anomalous behaviour, as shown by the system safety assessment process, would cause or contribute to a failure of system function resulting in a minor failure condition for the aircraft • Level E - Software whose anomalous behaviour, as shown by the system safety assessment process, would cause or contribute to a failure of system function with no effect on aircraft operational capability or pilot workload.
DO178B Software Level Determination • Initially, determination of software level is made without regard to system design (by considering loss of function and malfunction) • Where software contributes to >1 failure condition the highest failure condition category is used to set the level • Where system function allocated to 1+ partitioned parallel components: • (At least) one element will have level associated with highest failure condition • Other elements are set level appropriate to effect of loss of function • Examples: • Multiple-version dissimilar software architecture • Safety monitoring architecture • Where system function allocated to 1+ Serial components: • All elements have level associated with the highest failure condition • See next slide (taken from ARP4754) for architecture examples
Independence 1 • For different DALs/SILs – an argument of independence is required IEC 61508 definition • Functionally diverse Use of totally different approaches to achieve the same results • Based on diverse technologies Use of different types of equipment to perform the same results • Not share common parts or services… …whose failure could result in a failure of both systems • Predominant failure modes cause safe ‘direction’ of failure • Not share common maintenance or test procedures • Be physically separated
Independence 2 DO178B • For different levels between functionally independent components – partitioning required addressing following aspects • hardware resources – processors, memory, IO devices, timers … • control coupling – vulnerability to external access • data coupling – shared data, including processor stacks and regs. • hardware failure modes DS 00-56 Issue 2 • Independence in specification, design, development and maintenance • Components must be conceptually independent and relies on different design properties
Assessment: DO-178B Example Table A-3 Verification of Output of Software Requirements Process
Integrity Levels: A Critique • High SIL => ‘Better’ development & assurance techniques => fewer systematic errors • e.g. Use of MCDA Testing Coverage, Use of Formal Methods Who Says?! • Rules for allocations of SILs / DALs, either • Too Vague: DO178B Controllability Approach • Too Specific / Wrong: U.K. DS 00-55 • Rationale, anybody? • No experiential backing for quantification from SILs / DALs • SIL 4 => P(failure on demand) =10-4Why? • SILs are ‘fragile’ and non-transferable • DS 00-55 SIL 4 = DO178B DAL?
Summary of Integrity Levels • Safety Integrity Levels / Development Assurance Levels are pragmatic approach to dealing with systematic error • SIL/DAL is an indication of the required level to which a system must be free from flaws • Each level has associated safety conditions • Conditions must be demonstrated in order to claim SIL/DAL achieved • Intuition of Integrity Levels correct • however, the exact association of development techniques to each level and relationship to failure rate targets is questionable
Conclusions • No non-trivial system can be guaranteed free from error • No easy answers • rigorous / formal process can avoid errors from specification to implementation • errors of requirements and specification harder to control • Complexity trade off • as complexity increases, probability of correct implementation decreases • Studies show that architecture, process, software language etc. all have less impact on delivered product than domain knowledge