210 likes | 228 Views
This session focuses on the assessment of essential properties, such as dependability, reliability, availability, robustness, fault tolerance, cyber-security, and safety, for digital equipment used in safety-critical applications. It discusses the real needs, potential failures, mitigation approaches, and the importance of cyber-security and safety in ensuring the reliable operation of digital systems.
E N D
Assessment of Digital Equipment for Safety and High Integrity Applications – Session 2 of 6 Assessment Topics, Part 1 Thuy Nguyen and Ray Torok Joint IAEA - EPRI Workshop on Modernization of Instrumentation and Control Systems in NPPs 3 - 6 October, 2006 Vienna, Austria
Essential Properties Assessment of Digital Equipment for Safety and High Integrity Applications
Essential Properties -Dependability • Property allowing a well-founded confidence in the ability of a system to correctly provide an expected service • Different services may be associated with different dependability levels • Usually includes the following main factors • Adequacy of the specified service with respect to the real needs to be addressed by the system • Reliability, i.e., the likelihood that the specified service will be provided as specified • Availability, i.e., the proportion of time where the specified service is effectively provided • Robustness, i.e., the degree to which the system can provide an acceptable service even in abnormal conditions. Also includes • Fault-tolerance • Cyber-security • Safety, i.e., the avoidance of failure modes that have unacceptable consequences • Maintainability of the preceding factors over the required time period
Adequacy: What Are the Real Needs? • Often, the dominant cause of failure of highly dependable digital systems are specification faults • Functional ambition and complexity of individual services • Interdependencies between services • Complexity of interfaces and interactions with other equipment or systems, with human beings • Typical specification issues • Understanding the real needs to be satisfied, the environment of the digital system(s), and / or operational constraints • Real needs, system contexts may change over time • “Traduttore, traditore”: what is specified may not be exactly what is intended • “Intrinsic” errors: incompleteness, ambiguity, inconsistency, ...
Reliability: What Can Go Wrong? • Random failures • Due e.g., to hardware aging, wear, radiation • Effects specific to modern electronic technologies • Failures caused by manufacturing / installation errors • Failures caused by maintenance / modification errors • Competencies, data collection processes, spares • Failures caused by incorrect human-system interactions • Digital HSIs may be inadequate / too complex (human factors) • Digital systems may reduce / mitigate human mistakes • Digital failures • Digital faults: specification faults, design faults, incorrect data • Digital failures occur systematically in the same conditions • Risk of Common Cause Failures (CCF) of multiple systems or channels • Digital failures may originate in individual systems, or in interactions between systems
Digital Faults • No rigorous means to eliminate all digital faults • Mitigation approaches • Fault avoidance • Engineering processes, design rules • Fault detection & removal • Verification & Validation (V&V) processes and rules • Tolerance of residual faults • Avoid activation of residual faults • Activation of residual faults not resulting in system failure • Acceptable system failure modes • Many types of digital faults • Avoidance, elimination, tolerance approaches may depend on faults types
Availability • Main causes of digital systems unavailability • Failures • Maintenance, Periodic testing • Repairs, restarts • Restoring a complex digital system back to service often requires more than just hardware repairs or software reboots • Examples • Understanding the causes and consequences of the failure • Repairing data bases jeopardized by the failure • Resynchronization with other systems of the Infrastructure
Robustness • Ability to maintain the expected service even in abnormal situations • Abnormal external situations, including voluntary aggression • Internal failures • Ability to provide “graceful degradation” if the service cannot be maintained • Identification and specification of acceptable failure modes • Self-monitoring • Highly reliable, but delicate electronic components • Digital failure modes are sometimes difficult to predict
Cyber Security: Vulnerability Factors • Need to optimize operation & maintenance of plant systems and I&C equipment • Remote operation, Remote diagnostics, Remote software maintenance, Data collection • Defenses may need frequent updates that may adversely affect the other dependability factors • Use of “COTS” (Commercial Off-The-Shelf) products • “Black-boxes”, products with unknown vulnerabilities • Consequences may be serious • Unauthorized modification of critical data and software • Confidentiality of information • Standards exist but need to be adapted • Designed mainly for “classical” information systems
Safety • Ability to avoid / mitigate dangerous failures • Issues specific to digital systems • International standards, Best practices
Maintenance of Dependability • Despite • Commercial obsolescence and aging of I&C components & platforms • Modification in plant systems, other I&C systems, operation procedures and / or requirements • Staff turnover
Evaluating Quality & Dependability Assessment of Digital Equipment for Safety and High Integrity Applications
Rule-Based Approaches • Due to complexity, quality, dependability and safety of digital systems are often difficult to achieve and assess • Particularly when high levels of achievement and confidence are required • Standards, technical codes and regulations often specify “how to” requirements • It is assumed that complying with the rules helps • However • There is usually no strong guarantee that the desired properties will be achieved to the desired levels • The desired properties and levels could be achieved using different approaches than those specified by the rules • Rules are often technology and application domain dependent • New (regulatory) issues may not be covered by existing rules
Performance-Based Approaches • Direct justification of quality / dependability / safety “claims” • Claims that are difficult to justify can be decomposed (iteratively if necessary) into supposedly simpler sub-claims • Final sub-claims are supported by factual evidence • Claim Argument Evidence • Solves some of the weaknesses of rule-based approaches • Good design measures (beyond the generic rules) can be credited and are encouraged • Appropriately documented claims, argument and evidence can be reviewed by independent assessors • Modifications might be easier to assess and justify when a suitable claim-argument-evidence justification already exists • However • More practical experience is still needed • Existing systems and products have usually relied on rule-based approaches • Switching to performance-based approaches might be economically impractical
Types of Evidence - Development Process • Most requirements rule-based approaches concern the development process • Good development processes can help, but they are neither necessary nor sufficient • Most of these requirements represent good practice and should be followed anyway
Types of Evidence – Use of Standards • Hundreds of software development standards are available • No consensus on which development approach is best • Usually intended for large software development from scratch • Overkill for utility applications • Graded approach is most useful • Regulators have endorsed standards • Basis for selection is not clear • Perhaps because “something is better than nothing” • Use of standards implies systematic, well-documented development process • Reviewer should understand principles and confirm that standards were: • Correctly applied and documented • Used on the products of interest
Types of Evidence - Rigorous Reasoning • Systematic “proof” that a (sub-)claim is true • High level of confidence, but usually dependent on assumptions that should be clearly stated • Examples • Static resource allocation can guarantee that all the required resources will be available when necessary (a priori proof) • Formal verification may be used to guarantee freedom from particular “intrinsic software programming faults”, such as index overflows (a posteriori proof) • See also Defensive Measures, and Inter-Channel / Inter-System Data Communication and Susceptibility to Digital CCF • Not applicable to all types of claims and designs • Also, usually not applicable to the last stages of system integration
Types of Evidence - “Sampling” Techniques • Sampling is a universal practice • Testing, simulation • Many support tools • Essential in the later stages of system integration and for validation • Sufficiency criteria (coverage) and levels may be specified when high levels of achievement and confidence are required • But • When enough is enough? • Experience in operation • Some commercial products benefit from large or massive experience in operation • Necessary conditions • Credibility: Can we trust the claimed information? How do we know that failures are reported and correctly analyzed? • Applicability: Is the claimed experience applicable to the product we intend to use and to the expected conditions of use? • Sufficient volume of experience
Types of Evidence - Experience in Operation • Some commercial products benefit from large or massive experience in operation • Necessary conditions • Credibility: Can we trust the claimed information? How do we know that failures are reported and correctly analyzed? • Applicability: Is the claimed experience applicable to the product we intend to use and to the expected conditions of use? • Sufficient volume of experience • Usually not well-suited to programmable products or to products with complex behavior when high levels of achievement and confidence are required • May be used as complementary evidence, or as confirmation
Types of Evidence - Expert Judgment • In most cases, key aspects of assessments must rely on subjective judgment • Trade-offs, acceptance criteria for testing, ... • “Fuzzy” properties like ease of understanding, clarity of documentation, ... • Subjective judgment in an evaluation should be identified • So that other experts can say if they agree or not • Whenever possible, review guidelines should be provided
Conclusion • Rule-based evaluation approaches are still widely applied • And will remain so for the foreseeable future • Performance-based evaluation approaches can be used where rule-based approaches cannot be applied • New technologies (see FPGA) • Issues not well covered by current rules (see Inter-Channel / Inter-System Data Communication, and Susceptibility to Digital CCF) • Wider use of performance-based approaches and increased experience may help improve quality / dependability / safety evaluations