420 likes | 592 Views
Triage: Diagnosing Production Run Failures at the User’s Site J.Tucek, S.Lu, C.Huang, S.Xanthos, Y.Zhou, SOSP’07. Shimin Chen LBA Reading Group Presentation. Motivation. Software failures at end users’ sites Released SW still contain bugs Major contributor to down time and security holes.
E N D
Triage: Diagnosing Production Run Failures at the User’s SiteJ.Tucek, S.Lu, C.Huang, S.Xanthos, Y.Zhou, SOSP’07 Shimin ChenLBA Reading Group Presentation
Motivation • Software failures at end users’ sites • Released SW still contain bugs • Major contributor to down time and security holes
Previous work: offsite diagnosis (At the development site with programmers) • Cannot send programmers onsite to debug each failure • Privacy concerns limit the release of information (e.g. coredumps) to programmers • Difficult to reproduce failures in house for diagnosis • Cannot provide timely guidance to choose recovery strategy or security defense against attacks Automatically diagnosing software failures occurring in end-user site production runs
Difference between Detection and Diagnosis • Detection • “blindly” screens for possible problems • sees bug manifestation • Diagnosis • aims to understand a particular failure • finds root causes
More on Related Work • Diagnosis • Interactive debuggers: gdb • Program slicing: tools for removing unrelated source code lines • PSE: offline partial execution path constructors from a core dump 100X Overhead or rely on human guidance • Onsite SW failure diagnosis is primitive: • Dr.Watson, Mozilla Quality Feedback Agent • Collect core dumps and other simple raw information • Extracting more detailed information: • Traces network connections, system call traces, traces of predicated values • Deterministic replay tools for uniprocessor systems
Challenges for Onsite Diagnosis • Efficiently reproduce the occurred failure • Impose little overhead during normal execution • Require no human involvement • Require no prior knowledge
Contributions: • Conduct just-in-time diagnosis with checkpoint-reexecution support • New diagnosis techniques: delta generation and delta analysis • Automated, top-down, human-like diagnosis protocol • Leverage previous failure analysis techniques for onsite and post-hoc diagnosis • Real system experiments: Linux with 9 applications (including MySQL, Apache, etc.) • User study with 15 programmers
Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion
Architecture Rx Assertions & exceptions
Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion
Diagnosis Information • Failure type and nature • Failure-triggering input and environmental conditions • Failure-related code/variables and fault propagation chain
Extensions • Above is the implemented protocol • Can have many extensions • Bug diagnosis techniques, diagnosis order • Automatically fixing the bug • Filter failure-triggering inputs • Generate a patch? Dynamically deleting code or changing variable values?
Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion
Intuition • Identify what differs between failing and non-failing runs • Automatic, repetitive delta replays with controlled variation and manipulation to execution environments
Goals of Delta Generation • Generate many similar failing and non-failing replays • Collect info with DBI for delta analysis • Identify signatures of failure-triggering inputs and execution environments • Send to programmers • Guide online recovery and security defense
Delta Generation: changing inputs • At failure time, has a list of seen requests • Figure out which subset causes the failure • Then try: delete character or randomly change character in the requests
Delta Generation: changing environments • (similar to Rx and diehard) • Pad or zero-fill new allocations, change message order, drop messages, manipulate thread scheduling, modify system environment • Based on previous knowledge on error type can focus on some of these changes
Delta Generation: speculative changes (preliminary) • Force a non-taken branch • Forcefully change data value • Still being explored
Results of Delta Generation • Path: Basic block sequence • Basic block vector (counts)
Delta Analysis • Basic block vector comparison • Path comparison • Intersection with backward slice
Basic block Vector (BBV) Comparison • BBV contains dynamic count for each basic block • First calculate the Manhattan distance of each pair of failing and non-failing BBVs • Find out the minimal distance failing and non-failing BBVs
Path Comparison • Given the two BBVs, find the minimum edit distance (insertion/deletion/substitution) between the two corresponding paths
Data Delta Analysis (unimplemented) • Compare the values of key variables
Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion
Other Diagnosis Techniques Used • Core dump analysis: • Register state, signal, basic summaries of the stack and heap • Unwind stack for call-chain • Walk malloc’s internal data structure for heap problems • Dynamic bug detection during replay • Memory bug detector • Data race detector (PIN +happens before)
Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion
Implementation • Linux 2.4.22 • PIN: dynamically attached to target program in the beginning of every reexecution attempt
Machine Environments • Single processor 2.4GHz Pentium-4, 512KB cache, 1GB memory • For server application, two such machines connected with 100Mbps ethernet • Take checkpoints every 200ms, keeps 20 checkpoints
Results • Successfully found fault-triggering input • Delta analysis reduces dynamic basic blocks: avg 63%
User Study • 5 bugs in 3 toy programs and 2 real programs • 15 Programmers X 5 bugs randomly given Triage output or not
Outline • Introduction • Triage Architecture Overview • Diagnosis Protocol • Delta Generation and Analysis • Other diagnosis Techniques • Evaluations • Limitations and Extensions • Conclusion
Limitation and Extensions • Privacy policy: easier to understand the info in a Triage report in order to specify privacy policy • Automatic patch generation: limited success – patching for buffer overflow for a particular allocation point • Difficult bugs: do not crash or take long time to manifest (checkpoints may not be kept long enough)
Limitation and Extensions • Deterministic replay on multiprocessors • Deployment on highly-loaded machines • Cannot afford to full-fledged Triage analysis • Background? On a separate machine? Deferred? Simplified (better than nothing) • Handle false positives • Never encountered in the experiments • But may be solved by more sophisticated consistency checks among results produced by different diagnosis techniques
Conclusion • Onsite SW failure diagnosis • Lightweight checkpoint and recovery • Failure diagnosis protocol • Delta generation and analysis