1.19k likes | 2.56k Views
Safety Analysis Techniques 1: Failure Modes and Effects Analysis. System Level Safety Analysis – Overview. System level safety analysis is the first stage of evaluation of the safety properties of a system (as opposed to the predictive and requirement setting roles of hazard analysis)
E N D
Safety Analysis Techniques 1:Failure Modes and Effects Analysis
System Level Safety Analysis – Overview • System level safety analysis is the first stage of evaluation of the safety properties of a system (as opposed to the predictive and requirement setting roles of hazard analysis) • considers effects of known component failure modes (inductive), or explores design for possible causes of known hazards (deductive) • Techniques may also be appropriate for use at Preliminary System Safety Analysis (PSSA) – effectively a detailed predictive analysis • The class of system level safety analyses encompass most of the familiar safety techniques • of 101 techniques listed in the System Safety Analysis Handbook, approximately 25 fall into this category • most techniques can be applied at many levels of system decomposition
System Level Safety Analyses – Techniques to be covered This session • Failure Modes and Effects Analysis (FMEA) Next Session • Fault Tree Analysis (FTA) This afternoon • Failure Modes and Effects Summaries (FMES) • Cause-Consequence Analysis • Sneak Analysis • Common Cause Analysis (CCA) • Zonal Hazard Analysis (ZHA) • Particular Risk Analysis (PRA)
What is FMEA? What is FMEA? • Analysis of effects of single failures (inductive) • In safety process, used to • investigateeffects of known component failures • within (sub-) system • provide information about sub-system failures for • incorporation into system and platform level analyses • show that known single component failures will not lead to system failure (hazard) • Results presented in tabular form • In principle straightforward!
FMEA in Practice • In industrial practice, FMEA methods and presentation vary greatly: • FMEA-style tables used to record output of many different “methods” • note that most American texts regard FMEA as a purely quantitative technique • for reliability / availability analysis only • there is also a tendency for “FMEA” to be used as a general term for all tabular safety analysis techniques, leading to confusion with HAZOP, FFA etc • tendency for FMEAs to be regarded as write-only • Many different standards exist - see Dhillon • Also several tools - mainly databases
Mandatory Elements of the analysis 1 • Pre-requisites • system of interest -specification, current drawings or schematics, etc • Identification of component • piece part – item purchased as complete unit, e.g. resistor • line-replaceable units – lowest level at which repairs are made to a system, e.g. electronic modules • sub-system • system (function) and its environment • List of failure modes • From prior operational experience • By appeal to similar components previously used • Technology specific “checklist” of failure modes • From damage analysis • Identification of hierarchy of components / sub-systems / system so that failure effects can be built up
Mandatory Elements of the analysis 2 • Identification of effects of failure • effect – consequence a failure has on the operation, function, or status of a component, or the system • scope of effects investigation • immediate • system • environment • Effects at one level in the hierarchy may become failure modes at the next level • this is normally dealt with through a hierarchy of FMEAs • See FMES
Optional Elements of the analysis • Contributing factors, e.g. • means of detection / mitigation • identification of component function / state to which failure applies • Further columns for recommendations / justification / comments • verify analysis conclusions for any open issues - testing data, etc • reduce severity of effects by design revisions • reduce probability of occurrence through removal / control of cause or propagation mechanism • increase detection through increase in validation / verification activities • Failure probability (rate) • overall probability of failure of a component • modal failure probability (rate) – probability that component fails in a specific mode
FMECA Failure Modes, Effects and Criticality Analysis • Extension of FMEA which considers risk factors (severity, probability, risk class) of the failure effects • Probability is derived from component failure data… • when this value is not available the FMECA provides a (qualitative) relative ranking of failure modes • ...but severity and risk are system properties • can only be assigned if FMEA is taken through to system level effects (may have several effects columns) • i.e. for simple systems, or when sub-system FMEA is integrated into system analysis
Example Component : Vehicle speed sensor Context:
Suppliers Unit Eng Systems Eng FMECA Team Safety Team BITE Team System Test Externals Industrial FMECA Process
Comparison of FFA and FMEA • considers hypothetical failure modes – interpretations of generic failure classes • early in the lifecycle – preliminary / budgeting phase • function is to give confidence in overall design concept, and identify areas requiring risk reduction • works from known (or at least expected), component-specific failure modes • cannot begin until considerable design work completed • main function is providing detailed information on behaviour of components / subsystems in context FFA FMEA
Getting value from FMEA • For simple systems, analysis can investigate system level safety effects of component failures directly • For complex systems, FMEA must contribute information which can be used constructively • clear documentation of assumptions about use / environment of component / subsystem • clear indication of any failure modes which are hazardous in themselves • clear description of sub-system level failures for use at system (platform) level • very hard to achieve in practice
Pros and Cons Advantages • simple concept • flexible • widely applied, accepted and understood Disadvantages • may not identify effects of coincident failures – considering contributing factors may help to remedy this • can be difficult to apply to sub-systems • at what level should effects be studied? • no / insufficient view / understanding of system level effects • as with all tabular methods, can be hard to navigate large amounts of unstructured data
Fault Tree Analysis – Overview • The classic deductive analysis technique, which works back from undesired event to basic causes • Useful for both qualitative and quantitative analysis • Developed by Bell Labs and the USAF in early 1960s to investigate potential causes of inadvertent launch of Minuteman missile • Now most common diagrammatic safety analysis technique • US Nuclear Regulatory Commission Fault Tree Handbook is widely accepted as the definition of “standard” fault tree symbology and method
System Definition Fault TreeConstruction QualitativeAnalysis QuantitativeAnalysis UncertaintyAnalysis Interpretation of Results Overview • Aspects of Fault Tree Analysis: • Qualitative construction • Cutset analysis • Automated construction • Different roles of Fault Trees (PSSA and SSA) • Quantitative analysis of Fault Trees and what the numbers mean • Fault Tree calculations with exposure times • Pitfalls of Fault Trees
Top Event Top Event Gate Event Event Event Intermediate Events Gate Basic or Primary Events BEv1 BEv2 BEvm BEvn Fault Tree Construction
In Transfer Out A B Fault Tree Analysis Steps 1 • Select the event which is to be investigated • the top event • Identify immediate causes of this event • immediate – avoid missing out intermediate events – the “think small” principle • In the simple system below, the immediate causes of “No output from B” are “B fails” and “No transfer from A”, but not “A fails” • if several events must occur together to cause this event, use an AND gate • if any one of a set of events will cause this event, use an OR gate • Repeat
Fault Tree Analysis Steps 2 • BASIC EVENT – no decomposition required • COMPONENT DEFECT – May be one of: • Primary failure – simple component failure – a basic event • Secondary failure – component fails as a result of external influence – usually requires further investigation • Command failure – component receives incorrect control signals – always requires further investigation • SYSTEM DEFECT – not attributable to a single component. Requires further investigation
Fault Tree Construction Rules Additional rules have evolved to help ensure correct construction of fault trees: • all inputs to a gate should be defined before any one is examined in more detail • output of a gate must never directly form input to another gate – must always be a named intermediate event • text in boxes should be complete – what event is and when it occurs • causes always chronologically precede consequences – sounds obvious, but important in closed-loop control
Further gates • Also specialised gates • e.g. summation and comparator, where inputs can be “weighted” • rarely used
Handy Hints • Always follow the immediate cause rule • remember immediate does not mean obvious • e.g. an obvious cause of “car fails to stop” is “driver fails to press brake pedal” - but the immediate causes are “insufficient braking at wheels” or “no friction between tyres and road” • Good idea is to write down a list of obvious causes, and cross them out as they are reached • a well-constructed tree following the immediate cause rule will always include all causes, though it may not be intuitively obvious that this is the case • If faced with a situation where there seems to be more than one plausible way of proceeding, try both (or all) • often find that some quickly lead to tree with lots of identical branches - a strong hint that other ways are probably better
Documenting Fault Tree Analysis • Fault tree diagrams are excellent… • well defined, unambiguous semantics • high information density • good communication tool • … but they cannot stand alone • Supporting documentation should include • relevant design data (e.g. references to sources, versions) • rationale and explanations where logic is not obvious • NB some tools allow annotations • assumptions
Example 1 • A deluge system uses a header tank and sprinkler heads incorporating melt disks to feed and activate the system. If a fire should break out and reach a sufficiently high temperature, one or more melt disks will melt, allowing water to pass through the sprinkler. A manually operated valve is provided to isolate each sprinkler for maintenance. • Two pumps, one diesel and one electric, are used to maintain the level in the header tank. As water flows from the sprinklers, the water level in the header tank will fall. Each pump has its own level sensor, which starts the pump when the water level in the tank drops too low.
Example 2 • Build a fault tree for the top event “system does not provide continuous deluge Zone 1 when there is a fire in that zone” (assume perfect pipes and wiring!)
Top Wfails Z W Xfails Yfails X Y Qualitative Analysis Top = Wfails Ù Z = W Ù (Xfails Ú Yfails) = W Ù (X Ú Y) = W Ù X Ú W Ù Y where Ù is logical AND Ú is logical OR It is very common to just write Top = WX + WY when working with single letters for basic events
Top Wfails Z W Xfails Yfails X Y Cutsets and Minimal Cutsets • A cutset is a set of basic events that causes the top event to occur • A minimal cutset is a cutset which has no proper subsets that are cutsets Top = WX + WY MinimalCutsets {W,X}{W,Y}{W,X,Y} Cutsets
Informal Identification of Cut Sets 1 Tank Overflow fault tree has an AND gate at the top, so no single event is a cut set – will always need at least 2 events
Informal Identification of Cut Sets 2 Tank Overflows if A and B occur together, in combination with any other events, thus AB, ABC, ABD… are cut sets
Informal Identification of Cut Sets 3 Similarly, AC and any other events... …and also ADE and any other events
Minimal cutsets All Cut Sets for Tank Overflow Cut sets of size 2 • AB, AC Cut sets of size 3 • ABC, ABD, ABE, ACD, ACE, ADE Cut sets of size 4 • ABCD, ABCE, ABDE, ACDE All events • ABCDE Cut sets of size 2 • AB, AC Cut sets of size 3 • ABC, ABD, ABE, ACD, ACE, ADE Cut sets of size 4 • ABCD, ABCE, ABDE, ACDE All events • ABCDE • Note that A is a normal event, so AB and AC are one failure to top event
Minimal Cutsets and Reduction • General idea is to perform algebraic manipulation of the Boolean expression until it is in the form:where each Ci is a product expression for the ith minimal cutset • Lots of algorithms to do the required manipulations. Best leave it to tools for real systems, e.g. Fault Tree+
reduces to Minimal Cut Sets and Reduction 2 • E0 = E1 + E2 = A.E3 + B + E4 Expand = A(B + C) + B + AC Expand = AB + AC + B + AC Multiply out = AB + AC + B Remove repeated cut set = AC + B Boolean absorption rule
Working with Fault Tree Equations • Manual approach is slow, tedious… and error prone • Conventional Boolean algebra methods (e.g. Karnaugh maps, MacCluskey’s algorithm) may be used to simplify fault trees • Other algorithms have been proposed • look in journals such as IEEE Transactions on Reliability • For real systems, use tools! • e.g. Fault Tree+ • “Sanity check” the results • remember garbage in… garbage out
Top Wfails Z W Xfails Yfails X Y System Evaluation 1 • Apply basic rules of probability to determine probability of top event • Use the minimal form of top event Boolean expression Top = WX + WY pr(Top) = pr(WX) + pr(WY) - pr(WXWY) = pr(WX) + pr(WY) - pr(WXY) = pr(W)pr(X) + pr(W)pr(Y) - pr(W)pr(X)pr(Y) ASSUMPTIONMADE Using: pr(AB) = pr(A)pr(B) if independent pr(A + B) = pr(A) + pr(B) - pr(AB)
System Evaluation 2 • Let pr(W) = pr(X) = pr(Y) = 0.001 Top = WX + WY pr(Top) = pr(WX) + pr(WY) - pr(WXWY) = pr(WX) + pr(WY) - pr(WXY) = pr(W)pr(X) + pr(W)pr(Y) - pr(W)pr(X)pr(Y) = 0.001 0.001 + 0.001 0.001 - 0.001 0.001 0.001 = 0.000001 + 0.000001 - 0.000000001 = 0.000001999
Upper Bound Approximation • First approximation for system failure is given by : where pr(Ci) is small • In previous example Top = WX + WY pr(Top) = pr(WX) + pr(WY) - pr(WXY) = 0.000001999 pr(WX) + pr(WY) 0.000002 • Approximation is always conservative, but poor when pr(Ci) is not small • Consider same fault tree as before, but pr(W) = pr(X) = pr(Y) = 0.5 pr(Top) = pr(WX) + pr(WY) - pr(WXY) = 0.375 pr(WX) + pr(WY) 0.5 ?
Importance & Sensitivity • Measure of the contribution of a component to system reliability (or other characteristic) • employ conditional probability mathematics • Uses • measures effect of small changes in component reliability • means of highlighting critical items • to aid trade-offs and design • Purpose of a sensitivity analysis is to determine how sensitive system parameters are to changes in event failure probabilities • vary event failure probabilities above and below the normal values by a specified percentage • evaluate effect on system failure rate • can be done for other parameters, e.g. maintenance intervals
Fault Tree Analysis – Pros and Cons Advantages • thorough, systematic method • well-defined semantics and clear structure of diagrams • widely accepted and applied • can be used for probabilistic analyses • identifies single points of failure leading to top events Disadvantages • can be difficult to express complex situations, especially those typically found in computer systems • does not identify groups of faults with identical effects