220 likes | 326 Views
R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective , ICAC 2004 Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer? , HOTOS 2005.
E N D
R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004 • Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer?, HOTOS 2005. • Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with PeerPressure, OSDI ’04
R. Barret, P. Maglio, E. Kandogan, J. Bailey, Usable Autonomic Computing Systems: the Administrators' Perspective, ICAC 2004,
Motivation • the problem of administrating highly complex systems • managing complexity through automation • from low-level configuration settings to high-level business-oriented policies • the risk of making management harder • systems change more rapidly • administrator controls affecting more systems • So, administrator controls will be both more powerful and more dangerous • Goal: inform the design of AC • Methodology: ethnographic field study!
What system administrators do? • rehearsal and planning • maintaining situation awareness • managing multitasking, interruptions and diversions
Tools • command-line based console • command-line interfaces (CLIs) • multitasking, history, scripting • fast and reliable probing of disparate parts of system • easy to customize! • standalone graphical applications • graphical user interfaces (GUIs) • good for unfamiliar tasks and novice users • depending on graphics support, insufficient support for multitasking • web-based management tools • don’t depend on graphics support • can be integrated to provide an organized suite
Analysis and Guidelines for AC • Phases • rehearsal and planning • maintaining situation awareness • managing multitasking, interruptions and diversions
Rehearsing and Planning • necessary to critical systems because of both the chance for human error and the danger of unforeseen consequences • AC may increase both of these dangers • as the scale and degree of coupling within complex systems increases, new patterns of failure may develop through a series of several smaller failures • as autonomic managers automatically reconfigure subsystems, the results on the overall system may be difficult to predict • Guidelines • should be easy to build test systems • should be designed to be able to quickly undo changes
Situation Awareness • Administrators deal with dynamic and complex processes at many different levels of abstraction • They need to be aware of systems that are not only complex, but that also change frequently • Each system had its own management interface and so gaining overall situation awareness was very difficult • Guidelines • Automation has made operators more passive • Automated systems typically hide details from operators • Consequently, operator workload decreases during normal operating conditions, but increases during critical conditions • Must provide facilities for rapidly gaining deeper situation awareness when problems arise
Multitasking, Interruptions, Diversions • conventional systems • Working with many components, but each component works relatively independently • Guidelines • each level affects a component’s operation, it will be difficult to design a general workflow for debugging • Therefore AC interfaces should allow multiple simultaneous views of system components and aggregates to support interaction at multiple levels
Brown and J. Hellerstein, Reducing the Cost of IT Operations - Is Automation Always the Answer?, HOTOS 2005.
Is Automation Always the Answer? No! Why?
Helen J. Wang, John C. Platt, Yu Chen, Ruyun Zhang, and Yi-Min Wang, Automatic Misconfiguration Troubleshooting with PeerPressure, OSDI ’04
Misconfiguration Diagnosis • Technical support contributes 17% of TCO [Tolly2000] • Much of application malfunctioning comes from misconfigurations • Why? • Shared configuration data (e.g., Registry) and uncoordinated access and update from different applications • How about maintaining the golden config state? • Very hard [Larsson2001] • Complex software components and compositions • Third party applications • …
Outline • Motivation • Goals • Design • Prototype • Evaluation results • Future work • Concluding remarks
Goals • Effectiveness • Small set of sick configuration candidates that contain the root-cause entries • Automation • No second party involvement • No need to remember or identify what is healthy
Intuition behind PeerPressure • Assumption • Applications function correctly on most machines -- malfunctioning is anomaly • Succumb to the peer pressure
An Example • Is R1 sick? Most likely • Is R2 sick? Probably not • Is R3 sick? Maybe not • R3 looks like an operational state • We use Bayesian statistics to estimate the sick probability of a suspect -- our ranking metric
Registry Entry Suspects App Tracer Entry Data HKLM\Software\Msft\... On HKLM\System\Setup\... 0 Run the faulty app HKCU\%\Software\... null Canonicalizer Search & Fetch Troubleshooting Result Database Entry Prob. Peer-to-Peer Troubleshooting Community HKLM\Software\Msft\... 0.6 Statistical Analyzer HKLM\System\Setup\... 0.2 HKCU\%\Software\... 0.003 PeerPressure System Overview
Evaluation Data Set • 87 live Windows XP registry snapshots (in the database) • Half of these snapshots are from three diverse organizations within Microsoft: Operations and Technology Group (OTG) Helpdesk in Colorado, MSR-Asia, and MSR-Redmond. • The other half are from machines across Microsoft that were reported to have potential Registry problems • 20 real-world troubleshooting cases with known root-causes
Response Time • # of suspects: 8 to 26,308 with a median: 1171 • 45 seconds in average for SQL server hosted on a 2.4GHz CPU workstation with 1 GB RAM • Sequential database queries dominate
Troubleshooting Effectiveness • Metric: root cause ranking • Results: • Rank = 1 for 12 cases • Rank = 2 for 3 cases • Rank = 3, 9, 12, 16 for 4 cases, respectively • cannot solve one case
Concluding Remarks • Automatic misconfiguration diagnosis is possible • Use statistics from the mass to automate manual identification of the healthy • Initial results promising