160 likes | 311 Views
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support (CCMS). Yi-Min Wang, Chad Verbowski, John Dunagan, Yu Chen, Helen J. Wang, Chun Yuan, & Zheng Zhang Microsoft Research, Redmond & Beijing. The Problem: Computer Fragility.
E N D
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support (CCMS) Yi-Min Wang, Chad Verbowski, John Dunagan, Yu Chen, Helen J. Wang, Chun Yuan, & Zheng Zhang Microsoft Research, Redmond & Beijing
The Problem: Computer Fragility • “It worked yesterday, but not today.” • “It worked for that user, but not this user.” • “It worked on that machine, but not this machine.” • “I restarted the application, rebooted the machine, but still can’t fix the problem!” • We focus on Registry-related problems in this paper
PC: 200,000 Registry Values Human: 3 billion DNA base pairs Desktop Last Week Human #1 99% the same 99.9% the same Desktop Today Human #2 65% Similarity 70% - 90% Similarity >11% “Junk” Entries 50% “Junk” DNA Mouse Laptop 3 billion 200,000 < 5% Code for Config. changes < 2% Code for Proteins Inspired by the Human Genome Project Registry Entries for “Garbage fonts disease” Found at the Fontskey under HKLM\Software\Microsoft\ Windows NT\CurrentVersion Gene for Huntington's disease Found at the tip of the short arm of Chromosome 4
Contributions of STRIDER • Strider Principles • Key to handling complexity in CCMS • Problem decomposition into 7 Strider components • Strider Process • Conceptual use of Strider components to solve particular CCMS problem • Strider Toolkit • Implementation of Strider components as command-line building blocks • Strider Troubleshooter • UI root-cause analysis tool that strings together command-line tools for troubleshooting
Symptom- Based Analysis Knowledge, Experience, & Support database Imprecise, nondeterministic search Persistent Failure B Y Z C Mechanical & Statistical Latency Precise Database Lookup State- Based Analysis • “Is this a junk entry?” • “Who owns this entry?” • “Are there known problems with • this entry?” PC Genomics Database Principle #1: State-Based Analysis First-level decomposition: Mechanical, Statistical, & Database App or Action A State
Freedom & Flexibility Large install base The Mess: Number of different configurations Grows with the number Of machines 200,000 WinXP Registry 77,000 Good Bad Diff Large install base The Mass: Number of data points Grows with the number Of machines Diff Trace Intersection Diff System Restore Checkpoints Trace Bad Good Mechanical Principle #2: Attack The Mess With The Mass Second-level decomposition: Diff, Trace, & Intersection
Principle #3: Complexity-Noise Filtering Self-filtering of complexity as noise • A lot of the differences are not significant for systems management and troubleshooting • Registry entries that are constantly changing are less important; they are simply “operational states” • Inverse Change Frequency (ICF) ranking • Registry entries that are always different on different machines constitute natural diversity among Windows machines • Start with deterministic bad state, end with deterministic bad behavior • Nondeterministic activities in-between are often less important • Intersection of multiple traces can filter out such noise
Global state-snapshot repository Global cross-machine analysis for noise filtering & state ranking Local cross-time analysis for noise filtering & state ranking Intersection Diff Trace Good Mechanical Mechanical + Statistical 200,000 WinXP Registry 77,000 Good Bad Diff Diff Trace System Restore Checkpoints Bad
Registry Change-Behavior Analysis • Four machines, each with 84 days of checkpoints • Percentage ever changed: 4.7% - 13.2% • Percentage operational: 1.9% - 5.6% • Percentage installation/configuration: 2.1% - 11.3% • Median # changes/day = 302 (raw), 29 (noise filtered)
Strider Components • Mechanical • State Diff: diff “bad state” against “last known working state” • Tracing: failing app execution or booting • Intersection: diff & trace • Statistical • State Ranking: • Inverse Change Frequency (ICF) ranking: states with high change frequencies are less likely to be the root cause • Order ranking: states accessed later are more likely to be the result of execution divergence caused by the earlier root-cause entry • Database • PC Genomics Database: state functional & failure info 5.1. “Is this a junk entry?” – Noise Filtering 5.2. “Who owns this entry?” – Ownership Mapping 5.3. “Are there known problems with this entry?” – Support Database Lookup
Support Articles Config Action UI App Info Doc Tracing State Diff Support Database Lookup Ownership Mapping PC Genomics Database Intersection Noise Filtering State Ranking Filtered & Ranked Candidate Set Strider Process for Troubleshooting Solution-query phase Narrow-down phase The program keeps failing It was working Now it doesn’t work User Tool
After diff & trace intersection Average Registry size Two Orders Another Two Orders Of Magnitude After state diff Root cause Order-ranking After noise filtering Strider TroubleshooterCross-restore-point Results
Average Registry size Root cause Order-ranking After noise filtering Cross-machine Results After diff & trace intersection Number of Registry Values After state diff
Summary • Think outside the white-box • Derive “black-box manifests” through PC Genomics (tracing, diffing, & behavior modeling) and show their benefits for CCMS • White-box & black-box approaches complement each other • State+Symptom-based troubleshooting • State-based support articles can be retrieved by symptom-based search • Symptom-based search can be enhanced with additional state-based strings • Symptom-based matching can help state ranking
Future Work • Long-term goal: develop new abstractions for systems management • Configuration Change Audits • “What has changed on my machine since last week, and who did it?” • Impact Analysis • “Is applying this patch going to break my apps?” • Server Drift • “What’s causing my server machines’ configurations to diverge?”