250 likes | 401 Views
Using Time Travel to Diagnose Computer Problems. Andrew Whitaker , Rick Cox, Steve Gribble The University of Washington. Example Scenario. Mozilla Web browser locks up after installing an extension Current approaches are inadequate:
E N D
Using Time Travel to Diagnose Computer Problems Andrew Whitaker, Rick Cox, Steve Gribble The University of Washington
Example Scenario • Mozilla Web browser locks up after installing an extension • Current approaches are inadequate: • Google search reveals too many possibilities: bad extension, HTTP pipelining enabled, glibc update, invalid hostname, “some upgrade of some gnome or GTK package”, Mozilla bugs • Help menus cannot anticipate all error cases • Reinstalling Mozilla does not fix the problem
General Problem • WYNOT errors: system worked yesterday, not today • Other examples: • Misconfigured Internet servers • Administrator mistakes are the largest source of downtime • Conflicts between applications • Registry corruption, “DLL hell” • Security policy • Over-zealous firewall • Spyware, adware, viruses Goal: automate the diagnosis of change-induced errors
working working Chronus Overview • Use search to identify the transition from a working to a failing state: fault point working failing failing • Search requirements: • Time-travel mechanism • Testing mechanism
User-written software probe Is the system working? Chronus When did the system stop working? Analysis tools (diff, regdiff, log files) Why did the system stop working? Usage Model
Outline • Introduction • Design and implementation • Debugging Experience
Time Travel Mechanism • Log state changes using a time-travel disk • Boot a historical virtual machine • Captures boot-time configuration parameters • Capture state changes onto a copy-on-write disk • Avoids tampering with the system timeline
Time Travel Implementation • Functionality split across two VMs • Parent implements time-travel functionality • Child executes normal user programs COW Disk Time-travel disk Parent VM Child VM disk requests Denali VMM
Software Probes • Probe is arbitrary code that evaluates system correctness • Two varieties of probes: • Internal probes run inside the child VM • External probes run on a remote machine • Strategies for obtaining probes: • Pre-packaged libraries • Written on the fly by expert user or administrator
Outline • Introduction • Design and implementation • Debugging Experience
#!/bin/sh mozilla & sleep 5 mozilla -remote ping() echo ‘SUCCESS’ > /TTOUTPUT blocks if Mozilla hangs • Step 2: invoke search over a time range: % search -begin 169354 -end 180025 173562: FAILURE 173541: SUCCESS 173551: SUCCESS 173556: FAILURE 173553: FAILURE 173552: SUCCESS Debugging the Mozilla Hang • Step 1: write a probe that tests the behavior: #!/bin/sh mozilla & sleep 5 mozilla -remote ping() echo ‘SUCCESS’ > /TTOUTPUT 169354: SUCCESS 180025: FAILURE 169354: SUCCESS 180025: FAILURE 174689: FAILURE 172021: SUCCESS 173355: SUCCESS 174022: FAILURE 173688: FAILURE 173521: SUCCESS 173604: FAILURE
Mozilla Hang, Continued • Step 3: compute the change: % attach child-disk 173552 173553 % diff -r /child-before /child-after file /.mozilla/default/zc1irw5u.slt/chrome/chrome.rdf differs: <RDF:Description about="urn:mozilla:package:stockticker" c:baseURL="jar:file:///root/.mozilla/default/zc1irw5u.slt /chrome/stockticker.jar!/content/" c:locType="profile" c:author="Jeremy Gillick" c:authorURL="http://jgillick.nettripper.com/" c:description="Shows your favorite stocks in a customized ticker." c:displayName="StockTicker 0.4.2" c:extension="true” c:name="stockticker" c:settingsURL="chrome://stockticker/content/options.xul” />
Summary • Chronus uses search to find a failure-inducing state change • User-supplied probe need only test for correctness • “Time travel” built on a logging disk and a virtual machine monitor • Chronus can diagnose many common configuration errors More details to appear at OSDI 2004
Emerging Challenge: Evaluation in the Post-performance Era • How do we demonstrate “correctness”? • Conventional benchmarking cannot account for the “human factor” • Alternate approaches: • Proximate metrics • Bug count • User studies (Aaron Brown’s work) • Proofs • Research directions • Validating proximate metrics • Designing systems with evaluation in mind
Fault point Time system was working system was NOT working Blank
Why it works • Testing complexity does not scale with system complexity: HTTP GET Apache Perl MySQL Client Linux Error 400: Bad Request Firewall
Chronus in Action • 1) Notice a failure: • mount_nfs: rpcbind on server: RPC Port mapper failure - RPC: Timed out • 2) Write a probe to test for failure: • #!/bin/sh • echo 'SUCCESS' > /TTOUTPUT • 3) Use Chronus to locate the failure in time: • 4) Use diff to extract the result • file /etc/rc.conf differs: • ipfilter=YES
Motivation • Can we substitute HW effort for human effort? 1970’s Total ownership cost breakdown Hardware costs 2000’s People costs
block writes block reads index Time travel disks • Log a window of recent disk block changes: checkpoint region log region
Motivation • Complex systems require expert users • Performance tuning • Security policy specification • Software upgrades • Implications: • High cost • System administration is 60-80% of TCO • Poor quality • e.g., unpatched home user machines How can we use computer power to simplify system administration tasks?
Rollback-based Recovery • Key challenge: recovering lost work • Requires application assistance: • Windows XP: application-specific rollback • Operator Undo: application-specific state repair • May corrupt system state Lost work failing configuration known good configuration
STRIDER • Configuration debugging by observing program side effects • Disadvantages relative to Chronus • False positives • Registry-specific • Requires heuristics to prune the search space • Misses indirect dependencies • Advantages: • Only requires one program invocation • Can compare configurations across space • e.g., a Registry on a remote machine
block writes index block reads time Time Travel Disk Implementation • Capture and record block updates to a log region: checkpoint region log region
Chronus Design Choices • Time-travel disks • Pro: Captures all state changes without OS/app support • Pro: Simple (~1200 lines of code) • Con: Lack of semantic knowledge • Con: Inconsistent results from raw disk snapshot • Virtual machine restarts • Pro: More complete than application-level restarts • Pro: Faster, safer than physical machines restarts • Con: Requires that all devices have been virtualized • Con: Misses changes in the hardware-abstraction layer