260 likes | 705 Views
Recovery Oriented Software Joao Magalhães Orientadores: Arndt von Staa, Carlos J. P. Lucena Motivation As important as trying to avoid bugs is to write software that can coexist with them. It is not a matter of “if your software will fail” , but “when it will fail” .
E N D
Recovery Oriented Software Joao Magalhães Orientadores: Arndt von Staa, Carlos J. P. Lucena
Motivation • As important as trying to avoid bugs is to write software that can coexist with them. • It is not a matter of “if your software will fail”, but “when it will fail”. • Efforts shall be spent in minimizing the consequences when this moment comes. © LES/PUC-Rio
Motivation • We need mechanisms which: • prevent these bugs in • specification • architecture • design • coding standards • quality control prior to and after deployment • allow software to coexist with them: • trap and • exam bugs when they first occur. © LES/PUC-Rio
What is Recovery Oriented Software? • ROS takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved • This perspective is supported both by historical evidence and by recent studies on the main sources of outages in production systems © LES/PUC-Rio
What is Recovery Oriented Software? • The key ideas behind ROS are: • Concentrate on minimizing the amount of failures in your software • Concentrate on reducing Mean Time to Repair (MTTR) (thus offering higher availability) • Concentrate on minimizing the consequences of failures © LES/PUC-Rio
What is Recovery Oriented Software? • The main axioms: • “It is impossible to build perfect software” • “We are only humans, born to make mistakes”. And, yes, this also applies to software • “Software failures can be tolerated (to some extent) if their consequences are minimized” • When MS Word fails, do you get mad because of the software failure, or because of the possible loss of work? • Some consequences to be analyzed: • Loss of life • Damages to equipment, ecology, enterprise • Loss of money • Loss of work • Time to restart • Time to restore the previous “workbench” © LES/PUC-Rio
Building Recovery Oriented Software • So, how to build recovery oriented software? • There are four important points: • Fault Prevention Effort • Fault Detection Effort • Fault Handling Effort • Fault Removal Effort © LES/PUC-Rio
Building Recovery Oriented Software • Fault Prevention Effort • Effort spent during development time to avoid anomalies – bugs – in a software • Effort in good design • Allow for the use of stubs and mocks • Effort in tests (automated or not) • Generation of test cases • Validation of test cases • Use of Design by Contract © LES/PUC-Rio
Building Recovery Oriented Software • Fault Detection Effort • Effort spent during development time to detect faults in runtime. • Hardware dedicated to fault detection • Data-structure validators • Self-test algorithms • Use of Design by Contract with executable assertions turned on • Software redundancy and/or hardware redundancy • Comparison of the results obtained from different sources can indicate problems • Use of oracles to predict expected measurements (in control systems) © LES/PUC-Rio
Building Recovery Oriented Software • Fault Handling Effort • Effort spent during development time to handle detected faults in runtime. • Handling means recovering as gracefully as possible • Sometimes it is impossible to fully recover from an error • Degraded operation • Minimizing loss of data • Some examples: • Software redundancy and/or hardware redundancy • May avoid service interruption • Code that restores the system to a valid state © LES/PUC-Rio
Building Recovery Oriented Software • Fault Removal Effort • Once a fault has been detected, the first idea would be to fix it. • However, sometimes the fault removal can be too expensive if compared to its impact on the system • An alternative solution could be trying to co-exist with the fault • Of course, this is definitely cannot be applied to every fault! © LES/PUC-Rio
Building Recovery Oriented Software • Fault detection and fault handling require runtime effort • It is a continuous process • Fault detection is possible without human intervention • Or, at least, the detection of a possible fault can be automated, as the user may analyze the data available and decide that no fault is really present. • Fault handling may require human intervention • Sometimes, the user may want to keep a corrupted state in order to try to manually recover (and reduce the loss of data). © LES/PUC-Rio
Building Recovery Oriented Software • By our previous experiences in developing high availability software, we think that balancing the resources spent in each effort is what makes the difference to achieve the desired level of quality • But, once a software is built, how do we determine if the desired level of quality has been achieved? Even more, would it be possible to generate a software development process that, by construction, guarantees the level of quality? • This is yet to be defined... © LES/PUC-Rio
What extra effort does building a recovery oriented software requires? • This is difficult to prove, unless there were a great number of experiments (of our own), and there is only a few. • However, our feeling says that the “extra” effort spent in some issues end up reducing the efforts spent in other activities. © LES/PUC-Rio
Are there technologies available for building recovery oriented software ? • We had some good experiences with some existing technologies/practices • Software components • Design by contract • Mock elements • Extreme Programming (specially pair programming) • Strict coding discipline • And also with some tools • Subversion/CVS • Eclipse • Valgrind © LES/PUC-Rio
Software components • Structuring a software in small components: • Provides better level of control over development complexities • Provides better level of control over fault detection • Enhances the chances of isolating (existing) anomalies • Enhances chances of gracefully recover from faults • Enhances chances of reuse (thus allow for the natural maturity of components as time goes by) © LES/PUC-Rio
Design by Contract • Using contracts and executable assertions: • Increase significantly the design and coding phases • Reduce dramatically the test and homologation phases • Enhances fault detection capacity © LES/PUC-Rio
Mock Elements • Using mock elements: • Allow for independently testing of components and groups of components © LES/PUC-Rio
Extreme Programming • Using pair programming: • Enhances the quality and reduces the number of bugs in complex code © LES/PUC-Rio
Strict code discipline • Using a strict code discipline: • Creates a unique code appearance, thus reducing effort spent in understanding. • Enhances code productivity by reducing the number of coding problems © LES/PUC-Rio
Subversion/CVS • Using CVS/Subversion: • Provides the basis for team development • Keeps track of code changes and enhancements • Provides tools for release control © LES/PUC-Rio
Eclipse • Using eclipse: • Provides a unique interface for subversion/CVS even in different operating systems • Works as a workbench where plugins – like CDT and maven – can be installed to enhance productivity and automate tasks © LES/PUC-Rio
Valgrind • Using Valgrind: • Tool for UNIX systems that checks for memory leaks, and memory violations © LES/PUC-Rio
Conclusions • As there are not enough efficient tools to build bug-free software, building recovery oriented systems is the best effort that can be done. • Balancing efforts spent in fault prevention, fault detection, fault handling, and fault removal, seem to be a key to develop ROS • Some tools, practices and technologies contribute for building ROS © LES/PUC-Rio
References • Design for Testability for Object Oriented Software. Jeffery E. Payne et al. Object Magazine, 2001 • Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. David Patterson et al, http://www.stanford.edu/~candea/papers/roc_vision/roc_vision.html • Towards a Fault Tolerant Multi-Agent System Architecture. Sanjeev Kumar, Phillip Cohen. Agents 2000. • Generation of Self-Testing Components. Leonardo Mariani et al, 2003. © LES/PUC-Rio
References • Toward Systematic Design of Fault-Tolerant Systems. Algurdas Avizienis, IEEE, 1997 • Merging components and testing tools: The Self-Testing COTS Components (STECC) Strategy. Sami Beyeda et al, 2004 • Endo-Testing: Unit Testing with Mock Objects. Mackinnon T. et. al . XP2000. • Mocks aren`t stubs. Fowler, M. (2004) .Martin Fowler`s Blog. © LES/PUC-Rio