420 likes | 728 Views
Finding the Needle in the Haystack. CS598YYZ Paper Presentation. Configuration Debugging as Search: Finding the Needle in the Haystack. Andrew Whitaker, Rick Cox, Steve Gribble The University of Washington OSDI 2004, San Francisco, CA Presentor: Xiao Ma ( xiaoma2@cs.uiuc.edu ).
E N D
CS598YYZ Paper Presentation Configuration Debugging as Search: Finding the Needle in the Haystack Andrew Whitaker, Rick Cox, Steve Gribble The University of Washington OSDI 2004, San Francisco, CA Presentor: Xiao Ma (xiaoma2@cs.uiuc.edu) *Borrowed some figures from the author’s slides in OSDI 2004.
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Authors • Andrew Whitaker • Ph.D. CSE, UW (200?) • Now at Amazon.com • Denali and lightweight virtual machines. • Safe, Extensible Networking • Steven D. Gribble • Ph.D. EECS, UCB (2000) • Associate Professor in CSE at U of Washington. • Denali: Lightweight virtual machines for distributed and networked systems • Internet Systems Measurement and Analysis Projects CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Motivation - WYNOT Error • WYNOT errors: system worked yesterday, not today • Continual change is the fact, which result in most of the misconfiguration problems • Install/update applications • Change security policies • alter system configuration options • …… • WYNOT problems are hard to troubleshoot • Google search? – Too many noisy • Help documents? – Can it cover all possibilities? • Reinstall the application? – Cannot guarantee to solve the problem. WYNOT configuration errors are difficult. CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Motivation – Human Cost v.s. Hardware Cost Total Ownership Cost (TOC) breakdown 1970’s Hardware costs 2000’s People costs • Today, the diagnosis of misconfiguration problems seriously relies on human expertise, however: • Human experts are scare and expensive • Complexity is growing over time Can we substitute hardware effort for human effort? CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Goals Automate the diagnosis of WYNOT errors, even sacrifice a reasonable amount of hardware resource! CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Assumption • Assume that the problem happened due to the fact of the continual change Failure Transition Time System was working System was NOT working CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
Chronus External analysis tools When? CS598YYZ (Fall 2005) Design – Overview • Chronus Tool : Search across time for the instant the system transitioned into a failed state. Difficult System failure Why? CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Design – Normal Model Time-travel disk Normal user applications Chronus disk requests Parent Virtual Machine Child Virtual Machine Denali Virtual Machine Monitor • Child VM run the normal user applications • Parent VM record disk writes to a time-travel disk (TTDisk) • Each block write represents an instant in time CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
Was the system correct? COW Disk Time-travel Disk (Tbegin) probe disk requests Child Virtual Machine CS598YYZ (Fall 2005) Design – Debug Model Normal user applications Chronus Search Engine Parent Virtual Machine Denali Virtual Machine Monitor • Child VM run the normal user applications • Parent VM instantiates, boots and tests historical snapshots to search for the failure transition point, i.e. when the bad things happened. CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
block writes block reads meta-data CS598YYZ (Fall 2005) Design – Component (1): Time Travel Disk • Problem: Record previous system state • Solution: Use a time-travel disk to log each disk write Checkpoint Region Log Region CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
COW Disk Time-travel Disk CS598YYZ (Fall 2005) Design – Component (2): VMM - Denali • Problem: Instantiating a historical state • Solution: Boot a virtual machine - Denali • More complete than application-layer restarts • More convenient than physical machine restarts Virtual machine CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Design – Component (3): How to test? • Problem: How to test the re-running system • Solution: user-supplied “software probe” • Can be arbitrary code to test system behavior • Output is an opaque string #!/bin/sh TEMPFILE=./QXB50.tmp rm -f ${TEMPFILE} ssh root@10.19.13.17 'date' > ${TEMPFILE} if (test -s ${TEMPFILE}) then echo "SSHD UP" else echo "SSHD DOWN" fi exit 0 CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
Transition system was working system was NOT working CS598YYZ (Fall 2005) Design – Component (4): Search over time • Problem: Finding the failure transition quickly • Solution: Binary search across time • Zero transition -> the system started out broken Time CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Design – Component (5): Problem Diagnosis • Problem: Going from “when” to “why” • Solution: Simple compare commands -> Other debug tools • “diff” the file system before and after • Cross-referencing with system log files • …… CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Inject Faults • Problem: How to inject misconfiguration faults into a system • Solution: A fault-inject tool: etc-smasher • Every second, etc-smasher randomly chooses a file from /etc directory • 10% of the time, etc-smasher will add, remove or modify a single character in that file. CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Example (1) – Mozilla Hang • Application: Mozilla Web Browser on the NetBSD OS • Methodology: Install several extensions • Symptom: Mozilla freezes on startup • Fails to respond to user input CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Example (1) – Mozilla Hang (Step 1) • Write a probe to test the behavior #!/bin/sh mozilla & sleep 5 mozilla -remote ping() echo ‘SUCCESS’ > /TTOUTPUT blocks if Mozilla hangs CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Example (1) – Mozilla Hang (Step 2) • Invoke search over a time range % search -begin 169354 -end 180025 169354: SUCCESS 180025: FAILURE 174689: FAILURE 172021: SUCCESS 173355: SUCCESS 174022: FAILURE 173688: FAILURE 173521: SUCCESS 173604: FAILURE 173562: FAILURE 173541: SUCCESS 173551: SUCCESS 173556: FAILURE 173553: FAILURE 173552: SUCCESS Transition Point CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Example (1) – Mozilla Hang (Step 3) • Compute the change % attach time-travel-disk 173552 173553 % diff -r /before /after file /.mozilla/default/zc1irw5u.slt/chrome/chrome.rdf differs: <RDF:Description about="urn:mozilla:package:stockticker” ... c:author="Jeremy Gillick" c:authorURL="http://jgillick.nettripper.com/" c:description="Shows your favorite stocks in a customized ticker." c:displayName="StockTicker 0.4.2” “Stockticker” Package is the problem CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Example (2) – Graphical application: Mozilla Browser • Though the result seems not bad, personally, I believe that Chronus will not work well on graphical applications. CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Example (3) – Complex problem: an Apache error • The problem is • Failure: an Apache server returns “Forbidden” error • Root cause: the CGI script must connect to the back-end database with the use as : www. • Two choices of the probe • Successful-directed probe – false positive • Failure-directed probe – more precise CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
true error false positive false positive working time failing old IP address Apache upgrade suexec error CS598YYZ (Fall 2005) Example (3) – Complex problem: an Apache error #!/bin/sh curl http://www/../ -D temp.txt grep -i ‘forbidden’ temp.txt % diff -r /child-before /child-after file /usr/apache1/conf/httpd.conf differs: < User www > User rick Successful-directed probe Failure-directed probe CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Quantitative Evaluation – TTD performance • Acceptable overhead for common cases • Except some rare situations CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Quantitative Evaluation – Log inflation • Major overhead of Chronus • Modify the directory structure Excessive log growth • Solutions: • Log data compression • Temporarily deactivate versioning CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Quantitative Evaluation – Execution time • Execution time grows logarithmicallywith log length • Requires ~20 seconds per probe • ~12 seconds devoted to File System Consistency Check (FSCK) operations CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Shortcomings • Overhead • Storage overhead • VMM overhead • Debugging time could be very long (depends on the log size and execution time of the probe) • Human workload overhead (to make a effective probe is hard) • Limited errors can be diagnosed by Chronus • The errors must be captured by persistent storage • The errors must be reproducible (most of the Heisenbugs cannot be) • The errors must be testable • How about performance related problems? How to write probes for them? CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Shortcomings (con’t) • False positive • Multiple failures during a single time line • Multiple changes at a same “failure transition point” • “Semantic Gap” • Most of the time, the relationship between low-level events and high-level semantics are not oblivious. • Can we add high-level content into the log disk? • The experiment result is not general enough • Fault-injection is too simple and real-world problems could be much more complex • Not only researcher should be involved CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Future Work • Extend Chronus beyond a UNIX environment • Exploit different time-travel storage mechanism • Make Chronus more automatic • Automatically generate the probe • Automatically map from low-level change to high-level problems • More complete evaluation automatic • Real world misconfiguration problems (maybe our DSN paper would help ) • Different levels of users CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Concluding Remarks • Chronus can reveal when a misconfiguration error happened, so simplify the diagnose process. • Chronus combines time-travel storage, virtual machine monitors (VMM), software testing and search. • A prototype of Chronus can successfully diagnose many common configuration problems. Half-automatic diagnosis of misconfiguration is possible!! CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Outline 1. Authors 2. Motivation, Goals and Assumptions 3. Chronus Tool Design 4. Debugging Examples 5. Quantitative Evaluation 6. Shortcomings and Future Work 7. Concluding Remarks 8. Discussion CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Discussion - STRIDER • Advantage • Simpler to use, lower overhead • STRIDER can also exploit spatially • Disadvantage • False positive • Registry-limited. STRIDER needs registry-specific heuristics to prune the research space. • Misses indirect dependencies CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Discussion - PeerPressure • PeerPressure debugs across space instead of time • Advantage • Simpler to use, lower overhead • Comprehensive • Disadvantage • Single-entry problem • Registry-limited. PeerPressure also needs registry-specific knowledge. • Misses cross-application dependencies. CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) Discussion - Others • Can Chronus be used on other platforms? • How to merge low-level behavior and high-level semantics • Can we combine Chronus with PeerPressure? • How to maintain the consistency with the data in memory? • How to decide the granularity of blocks? • Too fine false positive due to temporary change • Too coarse too many difference between two adjacent instants. CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)
CS598YYZ (Fall 2005) THE END Thank you! Wish you guys a GREAT Thanksgiving Break! CS598YYZ(Fall 05) Paper Presentation – Finding the Needle in the Haystack (11/17/2005) Xiao Ma (xiaoma2@cs.uiuc.edu)