1 / 26

Recovery Oriented Software

Recovery Oriented Software Joao Magalhães Orientadores: Arndt von Staa, Carlos J. P. Lucena Motivation As important as trying to avoid bugs is to write software that can coexist with them. It is not a matter of “if your software will fail” , but “when it will fail” .

emily
Download Presentation

Recovery Oriented Software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recovery Oriented Software Joao Magalhães Orientadores: Arndt von Staa, Carlos J. P. Lucena

  2. Motivation • As important as trying to avoid bugs is to write software that can coexist with them. • It is not a matter of “if your software will fail”, but “when it will fail”. • Efforts shall be spent in minimizing the consequences when this moment comes. © LES/PUC-Rio

  3. Motivation • We need mechanisms which: • prevent these bugs in • specification • architecture • design • coding standards • quality control prior to and after deployment • allow software to coexist with them: • trap and • exam bugs when they first occur. © LES/PUC-Rio

  4. What is Recovery Oriented Software? • ROS takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved • This perspective is supported both by historical evidence and by recent studies on the main sources of outages in production systems © LES/PUC-Rio

  5. What is Recovery Oriented Software? • The key ideas behind ROS are: • Concentrate on minimizing the amount of failures in your software • Concentrate on reducing Mean Time to Repair (MTTR) (thus offering higher availability) • Concentrate on minimizing the consequences of failures © LES/PUC-Rio

  6. What is Recovery Oriented Software? • The main axioms: • “It is impossible to build perfect software” • “We are only humans, born to make mistakes”. And, yes, this also applies to software • “Software failures can be tolerated (to some extent) if their consequences are minimized” • When MS Word fails, do you get mad because of the software failure, or because of the possible loss of work? • Some consequences to be analyzed: • Loss of life • Damages to equipment, ecology, enterprise • Loss of money • Loss of work • Time to restart • Time to restore the previous “workbench” © LES/PUC-Rio

  7. Building Recovery Oriented Software • So, how to build recovery oriented software? • There are four important points: • Fault Prevention Effort • Fault Detection Effort • Fault Handling Effort • Fault Removal Effort © LES/PUC-Rio

  8. Building Recovery Oriented Software • Fault Prevention Effort • Effort spent during development time to avoid anomalies – bugs – in a software • Effort in good design • Allow for the use of stubs and mocks • Effort in tests (automated or not) • Generation of test cases • Validation of test cases • Use of Design by Contract © LES/PUC-Rio

  9. Building Recovery Oriented Software • Fault Detection Effort • Effort spent during development time to detect faults in runtime. • Hardware dedicated to fault detection • Data-structure validators • Self-test algorithms • Use of Design by Contract with executable assertions turned on • Software redundancy and/or hardware redundancy • Comparison of the results obtained from different sources can indicate problems • Use of oracles to predict expected measurements (in control systems) © LES/PUC-Rio

  10. Building Recovery Oriented Software • Fault Handling Effort • Effort spent during development time to handle detected faults in runtime. • Handling means recovering as gracefully as possible • Sometimes it is impossible to fully recover from an error • Degraded operation • Minimizing loss of data • Some examples: • Software redundancy and/or hardware redundancy • May avoid service interruption • Code that restores the system to a valid state © LES/PUC-Rio

  11. Building Recovery Oriented Software • Fault Removal Effort • Once a fault has been detected, the first idea would be to fix it. • However, sometimes the fault removal can be too expensive if compared to its impact on the system • An alternative solution could be trying to co-exist with the fault • Of course, this is definitely cannot be applied to every fault! © LES/PUC-Rio

  12. Building Recovery Oriented Software • Fault detection and fault handling require runtime effort • It is a continuous process • Fault detection is possible without human intervention • Or, at least, the detection of a possible fault can be automated, as the user may analyze the data available and decide that no fault is really present. • Fault handling may require human intervention • Sometimes, the user may want to keep a corrupted state in order to try to manually recover (and reduce the loss of data). © LES/PUC-Rio

  13. Building Recovery Oriented Software • By our previous experiences in developing high availability software, we think that balancing the resources spent in each effort is what makes the difference to achieve the desired level of quality • But, once a software is built, how do we determine if the desired level of quality has been achieved? Even more, would it be possible to generate a software development process that, by construction, guarantees the level of quality? • This is yet to be defined... © LES/PUC-Rio

  14. What extra effort does building a recovery oriented software requires? • This is difficult to prove, unless there were a great number of experiments (of our own), and there is only a few. • However, our feeling says that the “extra” effort spent in some issues end up reducing the efforts spent in other activities. © LES/PUC-Rio

  15. Are there technologies available for building recovery oriented software ? • We had some good experiences with some existing technologies/practices • Software components • Design by contract • Mock elements • Extreme Programming (specially pair programming) • Strict coding discipline • And also with some tools • Subversion/CVS • Eclipse • Valgrind © LES/PUC-Rio

  16. Software components • Structuring a software in small components: • Provides better level of control over development complexities • Provides better level of control over fault detection • Enhances the chances of isolating (existing) anomalies • Enhances chances of gracefully recover from faults • Enhances chances of reuse (thus allow for the natural maturity of components as time goes by) © LES/PUC-Rio

  17. Design by Contract • Using contracts and executable assertions: • Increase significantly the design and coding phases • Reduce dramatically the test and homologation phases • Enhances fault detection capacity © LES/PUC-Rio

  18. Mock Elements • Using mock elements: • Allow for independently testing of components and groups of components © LES/PUC-Rio

  19. Extreme Programming • Using pair programming: • Enhances the quality and reduces the number of bugs in complex code © LES/PUC-Rio

  20. Strict code discipline • Using a strict code discipline: • Creates a unique code appearance, thus reducing effort spent in understanding. • Enhances code productivity by reducing the number of coding problems © LES/PUC-Rio

  21. Subversion/CVS • Using CVS/Subversion: • Provides the basis for team development • Keeps track of code changes and enhancements • Provides tools for release control © LES/PUC-Rio

  22. Eclipse • Using eclipse: • Provides a unique interface for subversion/CVS even in different operating systems • Works as a workbench where plugins – like CDT and maven – can be installed to enhance productivity and automate tasks © LES/PUC-Rio

  23. Valgrind • Using Valgrind: • Tool for UNIX systems that checks for memory leaks, and memory violations © LES/PUC-Rio

  24. Conclusions • As there are not enough efficient tools to build bug-free software, building recovery oriented systems is the best effort that can be done. • Balancing efforts spent in fault prevention, fault detection, fault handling, and fault removal, seem to be a key to develop ROS • Some tools, practices and technologies contribute for building ROS © LES/PUC-Rio

  25. References • Design for Testability for Object Oriented Software. Jeffery E. Payne et al. Object Magazine, 2001 • Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. David Patterson et al, http://www.stanford.edu/~candea/papers/roc_vision/roc_vision.html • Towards a Fault Tolerant Multi-Agent System Architecture. Sanjeev Kumar, Phillip Cohen. Agents 2000. • Generation of Self-Testing Components. Leonardo Mariani et al, 2003. © LES/PUC-Rio

  26. References • Toward Systematic Design of Fault-Tolerant Systems. Algurdas Avizienis, IEEE, 1997 • Merging components and testing tools: The Self-Testing COTS Components (STECC) Strategy. Sami Beyeda et al, 2004 • Endo-Testing: Unit Testing with Mock Objects. Mackinnon T. et. al . XP2000. • Mocks aren`t stubs. Fowler, M. (2004) .Martin Fowler`s Blog. © LES/PUC-Rio

More Related