190 likes | 350 Views
Reliability engineering. William W. McMillan. 28 March 2013. In non-functional requirements, what are some of the reliability targets that might be defined?. General Approaches. Avoiding faults Develop in a way to prevent faults. Careful specification and programming. Detecting faults
E N D
Reliability engineering William W. McMillan 28 March 2013
In non-functional requirements, what are some of the reliability targets that might be defined?
General Approaches • Avoiding faults • Develop in a way to prevent faults. • Careful specification and programming. • Detecting faults • Formal verification • Extensive testing • Tolerating faults • Run-time response to faults • Recover and proceed
Diminishing Returns • Cost to catch each error goes up dramatically as more and more are caught. • Considered impossible to catch all errors. • Especially in systems with complex interactions among modules, with hardware, or between threads. • “Six Sigma” aims at 3.4 defects in 1 million items. • From Motorola, used by GE and others. • Spec limit is 6 SDs away from mean of measure. • E.g., Spec is 1000 ± 0.6; If mean is 1000, SD < 0.1 • Still not perfect!
What Six Sigma goal could be defined for software reliability?
Redundancy • Multiple versions of the software. • N-version programming • Different developers • Different languages and libraries • Installations on multiple hardware platforms. • Multiple methods to verify software. • Multiple sets of eyes on code.
How would you use redundancy in creating software to set off water sprinklers for fire suppression?
Observation • Process • Documented, archived, standardized • Monitoring at runtime • Performance: time, space, transmission rates • Inconsistencies between version or measures • Deadlocks • Memory access problems • Failure of assertions • State of hardware • Keep a trace.
Runtime Recovery • Exception handling is critical. • Record state and problem. • Run diagnostic routines. • Reset hardware. • Return to functional state. • Might have different versions “vote.” • Can sometimes reduce performance and still do job. • Slow down data transmission. • Throw away some packets. • Disable some functions.
Backup or Protection System • Runs in parallel with primary system. • Simpler than primary system. • Monitors sensors (possibly alternate ones), performance, etc. • Can intervene to: • Shut something down. • Start emergency actions (fire suppression, brakes, alarms…). • Take control from primary to get into safe state.
What kinds of systems could not function well with degraded performance?
Programming Practices • Validate data. • Range checks • Consistency checks • E.g., Car in “park” is not going 50 mph. • Encapsulate. • Use good languages • Object-oriented design or similar • Private data • Simple interfaces
Programming Practices • Control memory access • Array bounds • Pointers • Handle exceptions • Throw specific exception types and info. • Use assertions • Throw exception when one fails. • Time out when waiting for resource. • Install switches for debug mode, audit trails.
Programming Practices • Check versions of other components. • Define hierarchy of hardware needed. • Alternate ports, sensors, actuators,… • Alternate storage devices • Move to another if there’s a problem. • Make UI bulletproof • Consistency • Data types and ranges • Keep in sandbox
Programming Practices • Beware of recursion. • Can be inefficient. • Can blow the stack. • Beware of interrupts. • Device might send interrupt and halt a time-critical operation. • Program should have a plan for full data structure. • Buffer • Disk file
Think of a language that would not support these programming practices well. How would you use that language so as to overcome its deficiencies?
Measures of Reliability • Mean time between failures • Probability of failure on demand • When service requested, how often given? • Percent time available • E.g., web services • Percent of completed operations • Initiated by the program, e.g., • Step of motor, writing to port, saving data item,…
Measures of Reliability • Percent of data acquired • E.g., reading from stream, how many values lost? • Average quality of data • E.g., video • Percent time that status bits are not as expected. • …?
Think of some other reliability measures that might be useful.