210 likes | 335 Views
Software Defects and their Impact on System Availability. -- A study of field failures in operating systems IBM T.J. Watson 1991. Presenter: Shan Lu. Why software defect?. More severe than hardware defect Software cause 60% of outage [Gray’90] Not well understood and studied
E N D
Software Defects and their Impact on System Availability -- A study of field failures in operating systems IBM T.J. Watson 1991 Presenter: Shan Lu
Why software defect? • More severe than hardware defect • Software cause 60% of outage [Gray’90] • Not well understood and studied • Different characteristics from hardware • a bug can not be compared with a fault hardware component
Why ‘field’ failure? • Field failure • Failures that happen in production run • Different from defects detected in development & testing • Reflect the real world ‘impact’
Overview • Analyze field failures in Operating System • Get statistics on • Impact of errors • Error type breakdown • Error triggering breakdown • Failure symptom distribution • Others • Use these results to guide future research
Outline • Motivation • Overview • Data source • Design • Analysis results • These results indicate … • Related work
Data source RETAIN • RETAIN database • Remote Technical Assistant Information Network • APAR • Manually extract • Error type • Error trigger • Symptom • Sample the APARs APAR Symptoms Context & environment How to fix Standard attributes Severity (1—4) HIPER ILP
Overlay errors and general errors • Overlay errors • Errors cause storage overlay (memory corruption) • Hard to find and fix • Big impact on availability • Get sample set by key word searching • General errors • All errors including overlay errors • Get sample set by random sampling • Comparison will be made
Error Type • Orthogonal and confidently large class • Totally 13 types • Overlay: 8 • Allocation management • Pointer management • Copy overrun • Regular: plus 6 • Semantic errors • Synchronization error • Unclassified
Error triggering events • Boundary conditions • Bug fixes • Client code • Recovery or error handling • Timing • Unknown
Symptom codes • ABEND • Addressing error (may restart) • Endless wait • Incorrect output • Incorrect output without detecting the failure • Loop • OS goes to infinite loop. Needs restart • Message • Error message printed. Local recovery, no ABEND
Outline • Motivation • Overview • Data source • Design • Analysis results • These results indicate … • Related work
Impact • Does overlay errors have more impact?
Error Type of Overlay Errors • Which is most popular? • Copying Overrun (20%) • Allocation Mgmt. (19%) • Who has most impact? • Allocation Mgmt. (31%HIPERs, 17% IPLs) • Pointer Mgmt. (16%HIPERs, 27%IPLs) • More about copying overrun • Less impact (13%HIPERs, 5% IPLs) • Why?
Others Overlay Error Administrative Err. (Semantic Err.) Synchr. Error (?) Error Type of Regular Errors • Who will dominate? • Impact • HIPERs: Overlay—14%; Undefined State—49% • IPLs: Overlay –4%;Synchr.—70% Copying Overrun Type mismatch Undefined State
Error Triggering Events • What’s your guess? • Most timing-related problems? (Heisenbug) • Breakdown • What does it tell us?
What else we can do? • Dig more information from their RETAIN • Do better classification • Try more interesting question • Similar analysis on different applications • Try similar things for open source codes
What does the data tell us? • Test case design • Test boundary condition • Test recovery code • Bug detection • Memory bug detector • Synchronization bugs • Tools help fixing bugs
Something Related • National Vulnerability Database • Bugzilla (mozilla 1998)