150 likes | 264 Views
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software. Subhachandra Chandra Peter M. Chen University of Michigan Presentation – Lin Tan. Published in DSN 2000. Hypothesis. Most faults in release applications are transient [Jim Gray86]
E N D
Whither Generic Recovery from Application Faults?A Fault Study using Open-Source Software Subhachandra Chandra Peter M. Chen University of Michigan Presentation – Lin Tan Published in DSN 2000
Hypothesis • Most faults in release applications are transient [Jim Gray86] • Transient faults are more difficult to reproduce and to debug • Can generic recovery techniques survive most application faults without using application-specific information?
Methodology • Classify software faults into 3 types • One type: eliminated by generic recovery techniques • How many faults are this type? • Study a subset of faults of 3 applications • Apache – widely used HTTP server • Gnome – desktop environment • MySQL – multi-thread SQL database server • Conclusions
Fixed environment -> deterministic execution • Given a fixed operating environment, a set of concurrent, sequential processes is completely deterministic. [Dijkstra 72]
Software Fault Classification • Environment-independent - Determinstic • Long URL • Environment-dependent • Environment-dependent non-transient (Subjective) • Disk full • Environment-dependent transient (Subjective) • Race condition
Program Operating Environment • Software • Other programs • Kernel • Hardware • ECC errors • Interrupts • Thread scheduler • Timing of workload requests: typing speed • User Input: • part of the program • NOT part of the environment
Selection of Bugs • Apache: 50 bugs out of 5220 bug reports • Severe or critical bugs • Gnome: 45 bugs out of 500 bug reports • Only in core files, libraries, and four commonly used Gnome applications • Apache: 44 bugs out of 5220 messages from mailing list • Serious bugs
Example Bugs • Apache • Long URL causes overflow. • MySQL • Lack of file descrpitors. • Gnome • Race condition between a request for action from an applet and its removal. • Race condition between a image viewer and a property editor.
Limitations & Discussions • May differ for other applications • Only 3 applications • Only manually studied reported severe bugs (50/5220, 45/500, 44/44,000) • Use automated tools? • Better to implement a general recovery approach and verify the results.
Limitations & Discussions • Why so few transient faults? • People tend to not report transient bugs? • Ignore occurrence frequency of bugs • More reliable systems have more transient bugs?
Related Work • 5-13%: timing or synchronization related in the MVS OS, the DB2 and IMS DB. [Sullivan91, Sullivan92] • 14%: timing and race conditions in the Tandem GUARDIAN OS. [Lee and Iyer 93] • 29%: transient and could be recovered by the Tandem process-pair. [Lee and Iyer 93]
Conclusions • Classical application-generic recovery techniques, such as process pairs, without application specific information, will NOT be sufficient to enable these applications to survive most software faults.