1 / 42

Shimin Chen LBA Reading Group Presentation

Triage: Diagnosing Production Run Failures at the User’s Site J.Tucek, S.Lu, C.Huang, S.Xanthos, Y.Zhou, SOSP’07. Shimin Chen LBA Reading Group Presentation. Motivation. Software failures at end users’ sites Released SW still contain bugs Major contributor to down time and security holes.

niveditha
Download Presentation

Shimin Chen LBA Reading Group Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Triage: Diagnosing Production Run Failures at the User’s SiteJ.Tucek, S.Lu, C.Huang, S.Xanthos, Y.Zhou, SOSP’07 Shimin ChenLBA Reading Group Presentation

  2. Motivation • Software failures at end users’ sites • Released SW still contain bugs • Major contributor to down time and security holes

  3. Previous work: offsite diagnosis (At the development site with programmers) • Cannot send programmers onsite to debug each failure • Privacy concerns limit the release of information (e.g. coredumps) to programmers • Difficult to reproduce failures in house for diagnosis • Cannot provide timely guidance to choose recovery strategy or security defense against attacks  Automatically diagnosing software failures occurring in end-user site production runs

  4. Difference between Detection and Diagnosis • Detection • “blindly” screens for possible problems • sees bug manifestation • Diagnosis • aims to understand a particular failure • finds root causes

  5. Failure Diagnosis

  6. More on Related Work • Diagnosis • Interactive debuggers: gdb • Program slicing: tools for removing unrelated source code lines • PSE: offline partial execution path constructors from a core dump 100X Overhead or rely on human guidance • Onsite SW failure diagnosis is primitive: • Dr.Watson, Mozilla Quality Feedback Agent • Collect core dumps and other simple raw information • Extracting more detailed information: • Traces network connections, system call traces, traces of predicated values • Deterministic replay tools for uniprocessor systems

  7. Challenges for Onsite Diagnosis • Efficiently reproduce the occurred failure • Impose little overhead during normal execution • Require no human involvement • Require no prior knowledge

  8. Contributions: • Conduct just-in-time diagnosis with checkpoint-reexecution support • New diagnosis techniques: delta generation and delta analysis • Automated, top-down, human-like diagnosis protocol • Leverage previous failure analysis techniques for onsite and post-hoc diagnosis • Real system experiments: Linux with 9 applications (including MySQL, Apache, etc.) • User study with 15 programmers

  9. Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion

  10. Architecture

  11. Architecture Rx Assertions & exceptions

  12. Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion

  13. Diagnosis Information • Failure type and nature • Failure-triggering input and environmental conditions • Failure-related code/variables and fault propagation chain

  14. Extensions • Above is the implemented protocol • Can have many extensions • Bug diagnosis techniques, diagnosis order • Automatically fixing the bug • Filter failure-triggering inputs • Generate a patch? Dynamically deleting code or changing variable values?

  15. Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion

  16. Intuition • Identify what differs between failing and non-failing runs • Automatic, repetitive delta replays with controlled variation and manipulation to execution environments

  17. Goals of Delta Generation • Generate many similar failing and non-failing replays • Collect info with DBI for delta analysis • Identify signatures of failure-triggering inputs and execution environments • Send to programmers • Guide online recovery and security defense

  18. Delta Generation: changing inputs • At failure time, has a list of seen requests • Figure out which subset causes the failure • Then try: delete character or randomly change character in the requests

  19. Delta Generation: changing environments • (similar to Rx and diehard) • Pad or zero-fill new allocations, change message order, drop messages, manipulate thread scheduling, modify system environment • Based on previous knowledge on error type can focus on some of these changes

  20. Delta Generation: speculative changes (preliminary) • Force a non-taken branch • Forcefully change data value • Still being explored

  21. Results of Delta Generation • Path: Basic block sequence • Basic block vector (counts)

  22. Delta Analysis • Basic block vector comparison • Path comparison • Intersection with backward slice

  23. Basic block Vector (BBV) Comparison • BBV contains dynamic count for each basic block • First calculate the Manhattan distance of each pair of failing and non-failing BBVs • Find out the minimal distance failing and non-failing BBVs

  24. Path Comparison • Given the two BBVs, find the minimum edit distance (insertion/deletion/substitution) between the two corresponding paths

  25. Backward Slicing and Result Intersection

  26. Data Delta Analysis (unimplemented) • Compare the values of key variables

  27. Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion

  28. Other Diagnosis Techniques Used • Core dump analysis: • Register state, signal, basic summaries of the stack and heap • Unwind stack for call-chain • Walk malloc’s internal data structure for heap problems • Dynamic bug detection during replay • Memory bug detector • Data race detector (PIN +happens before)

  29. Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion

  30. Implementation • Linux 2.4.22 • PIN: dynamically attached to target program in the beginning of every reexecution attempt

  31. Machine Environments • Single processor 2.4GHz Pentium-4, 512KB cache, 1GB memory • For server application, two such machines connected with 100Mbps ethernet • Take checkpoints every 200ms, keeps 20 checkpoints

  32. Applications and Failures

  33. Results • Successfully found fault-triggering input • Delta analysis reduces dynamic basic blocks: avg 63%

  34. Case Study: Apache

  35. Normal Execution Overhead

  36. Diagnosis Efficiency

  37. User Study • 5 bugs in 3 toy programs and 2 real programs • 15 Programmers X 5 bugs randomly given Triage output or not

  38. Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion

  39. Limitation and Extensions • Privacy policy: easier to understand the info in a Triage report in order to specify privacy policy • Automatic patch generation: limited success – patching for buffer overflow for a particular allocation point • Difficult bugs: do not crash or take long time to manifest (checkpoints may not be kept long enough)

  40. Limitation and Extensions • Deterministic replay on multiprocessors • Deployment on highly-loaded machines • Cannot afford to full-fledged Triage analysis • Background? On a separate machine? Deferred? Simplified (better than nothing) • Handle false positives • Never encountered in the experiments • But may be solved by more sophisticated consistency checks among results produced by different diagnosis techniques

  41. Conclusion • Onsite SW failure diagnosis • Lightweight checkpoint and recovery • Failure diagnosis protocol • Delta generation and analysis

More Related