660 likes | 865 Views
Critical systems Lecture 6. Critical systems Critical systems specification Critical systems development Critical systems validation. This lecture is based on Chapters 3, 9, 20, 24. 1. Critical systems. Safety-critical systems
E N D
Critical systemsLecture 6 • Critical systems • Critical systems specification • Critical systems development • Critical systems validation This lecture is based on Chapters 3, 9, 20, 24 1
Critical systems • Safety-critical systems • a failure may result in injury, loss of life or major environmental damage. • Chemical manufacturing plant, aircraft • Mission-critical systems • a failure may result in the failure of some goal-directed activity • navigational system for a spacecraft • Business-critical systems • a failure may result in the failure of the business using that system • customer account system in a bank Critical systems are the systems where failures can result in significant economic losses, physical damage or threats to human life. High severity of these failures High cost of these failures 2
probability probability judgement judgement Dependability Dependability = Trustworthiness • A dependable system is trusted by its users. • Trustworthiness essentially means the degree of user confidence that the system will operate as they expect and that the system will not 'fail' in normal use. • Not dependable, Very dependable, Ultra dependable
Failures may occur in: • hardware • software • human mistakes The insulin pump system monitors blood sugar levels and delivers an appropriate dose of insulin Blood par ameters Blood Blood sugar Blood sugar Blood sugar level anal ysis le v el Insulin r equir ement computa tion Pump contr ol commands Insulin Insulin Insulin Insulin r equir ement deli v ery dosage contr oller Insulin pump data-flow • The system does not need to implement all dependability attributes. • It must be: • Available : to deliver insuline • Reliable: to deliver the correct amount of insuline • Safe: it should not cause excessive doses of insuline • It must not be • Secure: It is not exposed to external attacks.
Other dependability properties • Repairability • Reflects the extent to which the system can be repaired in the event of a failure. • Maintainability • Reflects the extent to which the system can be adapted to new requirements. • Survivability • Reflects the extent to which the system can deliver services whilst under hostile attack. • Error tolerance • Reflects the extent to which user input errors can be avoided and tolerated.
Due to additional design, implementation and validation Cost Increasing dependability • Increasing the dependability of a system costs a lot. • High levels of dependability can only be achieved at the expense of system performance. • Extra & redundant code for checking states and recovery. • Reasons for prioritising dependability: • Systems that are unreliable, unsafe or insecure are often unused. • System failure costs may be enormous. • It is difficult to retrofit dependability.
System A System B • It fails once a month • It takes 5 minutes to repair it • It fails once a year • It takes one week to repair it Availability and reliability • ReliabilityThe ability of a system or component to perform its required function under stated conditions for a specified period of time. • AvailabilityThe degree to which a system or a component is operational and accessible when required for use. Some users may tolerate frequent failures as long as the system may recover quickly from these failures 7
System failure An event that occurs at some point in time when the system does not deliver a service as expected by its users. System error Erroneous system behaviour where the behaviour of the system does not conform to its specification. System fault An incorrect system state, i.e., a system state that is unexpected by the designers of the system. Human error Human behaviour that results in the introduction of faults or mistake into a system Reliability Sommerville’s definition Other definition
Approaches to improve reliability • Fault avoidance • Techniques are used that either minimise the possibility of mistakes and/or trap mistakes/faults before these result in the introduction of system faults (use of static analysis to detect faults). • Fault detection and removal • Verification and validation techniques increasing the chances to detect and remove faults before the system is used (systematic testing). • Fault tolerance • Techniques ensuring that faults in a system do not result in system errors or that system errors do not result in system failures (use of self-checking facilities, redundant system modules).
Due to additional design, implementation and validation Cost • Different people will use the system in different ways so User 2 will encounter system failure. Input/output mapping • Removing X% of the faults in a system will not necessarily improve the reliability by X%. • At IBM: removing 60% of product defects resulted in a 3% improvement in reliability • Program defects may be in rarely executed sections of the code so may never be encountered by users. • Removing these does not affect the perceived reliability • A program with known faults may therefore still be seen as reliable by its users • The reliability of the system is the probability that a particular input will lie in the set of inputs that cause erroneous outputs
Safety models • Safety is concerned with ensuring that the system cannot cause damage. • Types of systems • Primary safety critical software • Software malfunctioning causes hardware malfuction resulting in human injury or environmental damage • Secondary safety critical software • Software malfunctioning results in design defects, which in turn pose a threat to humans and environment. 11
Unsafe reliable systems • Specification defects • If the system specification is incorrect then the system can behave as specified but still cause an accident. • Hardware failures may cause the system to behave in an unpredictable way • Hard to anticipate in the specification • Correct individual operator inputs may lead to system malfunction in specific contexts • Often the result of operator mistake
Safety terms Accident An unplanned event or sequence of events which results in (or mishap) human death or injury, damage to property or to the environment. A computer-controlled machine injuring its operator is an example of an accident. Hazard A condition with the potential for causing or contributing to an accident. A failure of the sensor that detects an obstacle in front of a machine is an example of a hazard. Damage A measure of the loss resulting from a mishap. Damage can range from many people killed as a result of an accident to minor injury or property damage. 13
Hazards severity An assessment of the worst possible damage which could result from a particular hazard. Hazard severity can range from catastrophic where many people are killed to minor where only minor damage results. Hazard The probability of the events occurring which create a probability hazard. Probability values tent to be arbitrary but range from probable (say 1/100 chance of a hazard occurring) to implausible (no conceivable situations are likely where the hazard could occur. Risk This is a measure of the probability that the system will cause an accident. The risk is assessed by considering the hazard probability, the hazard severity and the probability that the hazard will result in an accident. Safety terms 14
Most of the accidents are almost all due to a combination of malfunctions rather than single failures. Bild 15 Assuring safety • Assuring safety is to ensure either that accidents do not occur or that the consequences of an accident are minimal. • Hazard avoidance • The system is designed so that some classes of hazard simply cannot arise. • a cutting machine avoids the hazard of the operator’s hands being in the blade pathway • Hazard detection and removal • The system is designed so that hazards are detected and removed before they result in an accident • a chemical plant system detects excessive pressure and opens a relief valve to reduce the pressure before an explosion occurs • Damage limitation • The system includes protection features that minimise the damage that may result from an accident • Fire extinguishers in an aircraft engine
Exposure Possible loss or harm in a computing system. Analogous to accident. Vulnerability A weakness in a computer-based system that may be exploited to cause loss or harm. Analogous to hazard. Attack An exploitation of a system vulnerability. Threats Circumstances that have potential to cause loss or harm. Control A protective measure that reduces a system vulnerability. Security • The security of a system is an assessment of the extent that the system protects itself from external attacks that may be accidental or deliberate • virus attack • unauthorised use of system services • unauthorised modification of the system and its data. • Security is becoming increasingly important as systems are networked so that external access to the system through the Internet is possible • Security is an essential pre-requisite for availability, reliability and safety 16
Types of damage caused by external attack • Denial of service • Normal services are unavailable or service provision is significantly degraded • Corruption of programs or data • The programs or data in the system may be modified in an unauthorised way • Disclosure of confidential information • Information may be exposed to people who are not authorised to read or use that information For some types of critical system, security is the most important dimension of system dependability, for instance, systems managing confidential information. 17
Security assurance • Vulnerability avoidance • The system is designed so that vulnerabilities do not occur. • If there is no network connection then external attack is impossible. • Attack detection and elimination • The system is designed so that attacks on vulnerabilities are detected and neutralised before they result in an exposure. • Virus checkers find and remove viruses before they infect a system. • Exposure limitation • The system is designed so that the adverse consequences of a successful attack are minimised. • A backup policy allows damaged information to be restored.
Critical systemsLecture 4 • Critical systems • Critical systems specification • Critical systems development • Critical systems validation 19
'Shall not requirements' • The system shall not allow users to modify access permissions on any files that they have not created (security) • The system shall not allow reverse thrust mode to be selected when the aircraft is in flight (safety) Critical systems specification • The specification for critical systems must be of high quality and accurately reflect the needs of users. • Types of requirements: • System functional requirements • define error checking, recovery facilities and other features • Non-functional requirements • for availability and reliability • ‘shall not’ requirements. • For safety and security • Sometimes decomposed into more specific functional requirements.
Stages of risk-based analysis Risk reduction assessment: Define how each risk must be eliminated or reduced when the system is designed. Risk identification: Identify potential risks that may arise. Risk analysis and classification: Assess the seriousness of each risk. Risk decomposition: Decompose risks to discover their potential root causes. Applicable to any dependable attribute
Risk identification • Insulin overdose (service failure). • Insulin underdose (service failure). • Power failure due to exhausted battery (electrical). • Electrical interference with other medical equipment (electrical). • Poor sensor and actuator contact (physical). • Parts of machine break off in body (physical). • Infection caused by introduction of machine (biological). • Allergic reaction to materials or insulin (biological). • Identify the risks faced by the critical system. • In safety-critical systems, the risks are the hazards that can lead to accidents. • In security-critical systems, the risks are the potential attacks on the system.
Unaccepta b le r eg ion Risk cannot be toler a ted Risk toler a ted onl y if AL AR P risk r eduction is impractical r eg ion or g r ossl y e xpensi v e Accepta b le r eg ion Neglig ib le risk Risk analysis and classification • The process is concerned with understanding the likelihood that a risk will arise and the potential consequences if an accident or incident should occur. Acceptability level
Risk analysis and classification • Estimate the risk probability and the risk severity. • The aim must be to identify risks that are likely to arise or that have high severity.
Risk decomposition, fault-tree technique • A deductive top-down technique concerned with discovering the root causes of risks in a particular system. • Put the risk or hazard at the root of the tree and identify the system states that could lead to that hazard. • Where appropriate, link these with ‘and’ or ‘or’ conditions. • A goal should be to minimise the number of single causes of system failure.
Risk reduction assessment • The aim of this process is to identify dependability requirements that specify how the risks should be managed and ensure that accidents/incidents do not arise. • Risk reduction strategies • Risk avoidance: the system is designed so that the risk or hazard cannot arise • Risk detection and removal: the system is designed so that risks are detected and neutralised before they result in an accident. • Damage limitation: the system is designed so that the consequences of an accident are minimised.
the system specification should be formulated so that the hazards are unlikely to result in an accident Safety specification • The safety requirements of a system should be separately specified. • These requirements should be based on an analysis of the possible hazards and risks.
Security specification • Has some similarities to safety specification • Not possible to specify security requirements quantitatively; • The requirements are often ‘shall not’ rather than ‘shall’ requirements. • Differences • No well-defined notion of a security life cycle for security management; No standards; • Generic threats rather than system specific hazards; • Mature security technology (encryption, etc.). • The dominance of a single supplier (Microsoft) means that huge numbers of systems may be affected by security failure.
Security technolog y analysis T echnolog y analysis Asset T hreat analysis and Security req. identification risk assessment specification T hreat assignment Security S ystem asset T hreat and requirements list risk matrix Asset and threat description The security specification process Available security technologies and their applicability against the identified threats are assessed. Possible security threats are identified and the risks associated with each of these threats is estimated. The assets (data and programs) and their required degree of protection are identified. The security requirements are specified. Where appropriate, these will explicitly identify the security technologies that may be used to protect against different threats to the system. Identified threats are related to the assets so that, for each identified asset, there is a list of associated threats. The approach to security analysis is based around the assets to be protected and their value to an organisation.
Types of security requirement • Intrusion detection requirements • How to detect attacks • Non-repudiation requirements • A third party in a transaction cannot deny its involvement • Privacy requirements • How to maintain data privacy • Security auditing requirements • How system use can be audited and checked • System maintenance security requirements • How not to accidentally destroy security • Identification requirements • Should the system identify users before interacting with them • Authentication requirements • How are users identified • Authorisation requirements • The privileges and access permissions of the users • Immunity requirements • How to protect the system against threads • Integrity requirements • How avoid data corruption
System reliability specification • Hardware reliability • What is the probability of a hardware component failing and how long does it take to repair that component? • Software reliability • How likely is it that a software component will produce an incorrect output. Software failures are different from hardware failures in that software does not wear out. It can continue in operation even after an incorrect result has been produced. • Operator reliability • How likely is it that the operator of a system will make an error?
Functional reliability requirements • A predefined range for all values that are input by the operator shall be defined and the system shall check that all operator inputs fall within this predefined range. • The system must use N-version programming to implement the braking control system. • The system must be implemented in a safe subset of Ada and checked using static analysis. Functional reliability requirements specify how failures may be avoided or tolerated.
Non-functional reliability specification • An appropriate reliability metric should be chosen to specify the overall system reliability. • Reliability measurements do NOT take the consequences of failure into account. Non-functional reliability requirements are expressed quantatively
Repeatable • Occurred only once Failure classification Transient Occurs only with certain inputs Permanent Occurs with all inputs Recoverable System can recover without operator intervention Unrecoverable Operator intervention needed to recover from failure Non-corrupting Failure does not corrupt system state or data Corrupting Failure corrupts system state or data • The types of failure are system specific and the consequences of a system failure depend on the nature of that failure. • When specifying reliability, it is not just the number of system failures that matter but the consequences of these failures.
Steps to a reliability specification • For each sub-system, analyse the consequences of possible system failures. • From the system failure analysis, partition failures into appropriate classes. • Transient, permanent, recoverable, unrec., corruptive, non-corruptive • Severity • For each failure class identified, set out the reliability using an appropriate metric. • PODOF = 0.002 for transient failures • PODOF = 0.00002 for permanent failures • Identify functional reliability requirements to reduce the chances of critical failures.
Critical systemsLecture 4 • Dependability • Critical systems specification • Critical systems development • Critical system validation 37
Approaches to developing dependability software cost The cost of producing fault free software is very high. It is only cost-effective in exceptional situations. It is often cheaper to accept software faults and pay for their consequences than to expend resources on developing fault-free software. • Fault avoidance • design and implementation process to minimise faults • Fault detection • V&V to discover and remove faults • Fault tolerance • Detect unexpected behaviour and prevent system failure A fault-tolerant system is a system that can continue in operation after some system faults have manifested themselves. The goal of fault tolerance is to ensure that system faults do not result in system failure.
Techniques for developing fault-free software • Dependable software processes • Quality management • Formal specification • Static verification • Strong typing • Safe programming • Protected information • Information hiding and encapsulation
Dependable processes A software development process is well defined, repeatable and includes a spectrum of verification and validation activities (irrespective of the people involved in the process)
Process activities for fault avoidance and detection • Requirements inspections. • Requirements management. • Model checking. • Internal, external (dynamic and static models are consistent) • Design and code inspection. • Static analysis. • Test planning and management. • Configuration management.
Some standards for safety-critical systems development completely prohibit the use of some of these constructs Floating-point numbers Pointers Dynamic memory allocation Parallelism Recursion Interrupts Inheritance Aliasing Unbounded arrays Default input processing Programming constructs and techniques that contribute to fault avoidance and fault tolerance • Design for simplicity • Exception handling • Information hiding • Minimise the use of unsafe programming constructs.
System action Human and/or system action Fault tolerance actions • Fault detection • The system must detect that a fault (an incorrect system state) has occurred or will occur. • Damage assessment • The parts of the system state affected by the fault must be detected and assessed. • Fault recovery • The system must restore its state to a known safe state. • Fault repair • The system may be modified to prevent recurrence of the fault. As many software faults are transitory, this is often unnecessary.
Fault detection and damage assessment • Define constraints that must hold for all legal states. • Check the state against these constraints. • Checksums are used for damage assessment in data transmission. • Redundant pointers can be used to check the integrity of data structures. • Watch dog timers can check for non-terminating processes. If no response after a certain time, a problem is assumed. Preventative fault detection The fault detection mechanism is initiated before the state change is committed. If an erroneous state is detected, the change is not made. Retrospective fault detection The fault detection mechanism is initiated after the system state has been changed.
Fault recovery and repair • Backward recovery • Restore the system state to a known safe state. • Forward recovery • Apply repairs to a corrupted system state and set the system state to the intended value.
A1 Output A2 compar a tor A3 Hardware fault tolerance • Depends on triple-modular redundancy (TMR). • There are three replicated identical components that receive the same input and whose outputs are compared. • If one output is different, it is ignored and component failure is assumed.
As in hardware systems, the output comparator is a simple piece of software that uses a voting mechanism to select the output. N-version programming
T est f or T ry algorithm success 1 Acceptance Algorithm 1 Contin ue e x ecution if test acceptance test succeeds Signal e x ception if all algorithms fail R etry Acceptance test fails – r etry R e-test R e-test Algorithm 2 Algorithm 3 R eco v ery blocks Recovery blocks The different system versions are designed and implemented by different teams.
Critical systemsLecture 4 • Dependability • Critical systems specification • Critical systems development • Critical system validation 49
Validation of critical systems • The verification and validation for critical systems involves • High costs of failure • High cost of V&V • V & V costs take up more than 50% of the total system development costs.