360 likes | 480 Views
Automated Known Problem Diagnosis with Event Traces. Chun Yuan , Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, Wei-Ying Ma Microsoft Research , Tsinghua, USTC. Motivation . Computer problem diagnosis: inefficient & inaccurate Too much human involvement
E N D
Automated Known Problem Diagnosis with Event Traces Chun Yuan, Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, Wei-Ying Ma Microsoft Research, Tsinghua, USTC
Motivation • Computer problem diagnosis: inefficient & inaccurate • Too much human involvement • Product feedback channel becoming mature • Diagnosis process remaining labor intensive • Solutions need manual inspection • Automating diagnosis process for known problems • Text-based symptom matching • Leveragingsystem information • State-based diagnosis + root cause ranking [Strider] • Online crash analysis • Mechanized troubleshooting steps • How to represent a known problem? How to recognize a known problem?
Our Approach • Known problems are annotated with relevant system behavior • New behavior is classified to some known problem based on similarity to existing behavior • Event traces as typical exhibitions of system behavior
Labeling Known Problem DB Workflow (2) Troubleshooter Classifier (3) (1) (4) Apps Tracer
Tracer • What events to collect? • System calls • What attributes to collect? • Process/thread ID • Process/thread name • System call name, parameters, return value
Trace Example # process thread syscall paramaters & return value ... 18419 iexplore.exe 3892 CreateThread Process: 3888, Thread: 3896 SUCCESS 18420 iexplore.exe 3892 PostMessage WM_USER+0x300 1 18421 iexplore.exe 3892 OpenKey HKCU\SOFTWARE\Microsoft\Internet Explorer\Main SUCCESS 18422 iexplore.exe 3892 QueryValue HKCU\SOFTWARE\Microsoft\Internet Explorer\Main\Enable Browser Extensions NOTFOUND 18423 iexplore.exe 3892 OpenKey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings SUCCESS 18424 iexplore.exe 3892 QueryValue HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings\DontUseDNSLoadBalancing NOTFOUND 18425 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\Component SUCCESS 18426 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\Contact SUCCESS 18427 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\Distribute SUCCESS 18428 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\InocDll Views SUCCESS 18429 iexplore.exe 3892 CloseKey HKCU\SOFTWARE\Microsoft\Internet Explorer\Main SUCCESS 18430 iexplore.exe 3892 CloseKey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings SUCCESS 18431 iexplore.exe 3892 PeekMessage WM_LBUTTONDOWN 1 18433 CcmExec.exe 3068 OpenKey HKCU SUCCESS 18434 CcmExec.exe 3068 OpenKey HKCU\Software\Policies\Microsoft\Control Panel\Desktop NOTFOUND 18435 InoTask.exe 2024 ReadFile \\??\C:\Program Files\MSDN\2003FEB\1033\dnmscs.hxi SUCCESS 18436 CcmExec.exe 3068 OpenKey HKCU\Control Panel\Desktop SUCCESS 18437 CcmExec.exe 3068 QueryValue HKCU\Control Panel\Desktop\MultiUILanguageId NOTFOUND 18438 CcmExec.exe 3068 CloseKey HKCU\Control Panel\Desktop SUCCESS 18439 CcmExec.exe 3068 CloseKey HKCU SUCCESS 18440 CcmExec.exe 3068 OpenFile \\??\C:\WINDOWS\System32\dwwin.exe SUCCESS 18441 CcmExec.exe 3068 CreateFile \\??\C:\WINDOWS\System32\dwwin.exe SUCCESS 18442 CcmExec.exe 3068 QueryValue HKCU\Control Panel\Desktop\MultiUILanguageId NOTFOUND 18443 CcmExec.exe 3068 CloseKey HKCU\Control Panel\Desktop SUCCESS ...
Classifier • Classification • Training: learn a model from annotated training data • Testing: predict the class a novel one belong to • Accuracy • Percentage of true positive data • Cross-validation • Feature: N-gram • Any N successive elements in a sequence • Feature vector • Given a set of sequences, let {g1, g2, …, g_K} be all unique N-grams occurring in any of the sequences • A sequence is represented by the vector (b1,…,b_K) • b_k = 1 if g_k is in the sequnce • b_k = 0 otherwise
Support Vector Machines • The decision boundary should be as far away from the data of both classes as possible, that is, maximizing the margin, m Class 2 m Class 1
Comparing System Call Traces • Understand system call variations • Sequence alignment [abcdb] [abcd b ] [adabe] => [a dabe ] [bcdae] [ bcda e ] [abcabf] [abc ab f] • Uniform ordering • Group by process/thread • Sort by process/thread name
A B C System Call Variation • Cross-time comparison • Noise filtering • Patterns occurring at less than Th_t% times are discarded • Object name canonicalization • Registry path discarded • Case-by-case rules
System Call Variation • Cross-machine comparison • Noise filtering • Patterns occurring on less than Th_m% machines are discarded • Canonicalization • File path discarded
Evaluation • Select target problems • Collect traces with fault injection • Parameter sensitivity (cross-time & cross-machine) • Noise filtering threshold • Canonicalization or not • Pattern length • System call attributes availability
Data Collection • For each machine • For each round • For each problem • Inject the fault into the machine • Repeat a number of times • Start tracer • Replay the UI script to reproduce the symptom • Stop tracer • Remove the fault • Cross-time: 2 cases, 5 machines, 6 rounds (3 for training) • Cross-machine: 4 cases, 20 machines (10 for training)
Noise Filtering Threshold • Impact on accuracy IeDisplay SharedFolder
Noise Filtering Threshold • Impact on feature space dimension and training time SharedFolder IeDisplay
Canonicalization IeDisplay SharedFolder
Pattern Length IeDisplay SharedFolder
Attributes Availability • No thread name, system call parameters, return value IeDisplay SharedFolder
Summary of Cross-time Results • Noise filtering substantially reduces feature space and training time • Canonicalization has no effect • Longer patterns only helpful when fewer attributes are available
Noise Filtering Threshold • Feature space dimension and training time
Pattern Length • Sort traces by system call name
Attributes Availability • Process name + system call name
Convergence (a) 1-gram, full system call record (b) 2-gram, process name + system call name
Best Results (a) 1-gram, full system call record (b) 2-gram, process name + system call name
Summary of Cross-machine Results • Canonicalization is very effective • 1-gram is good enough; 2-gram is useful when only process name and system call name are available • Accuracy converges very quickly with more training machines
Related Work • Using computers to diagnose computer problems [Redstone et al.] • Capturing, Indexing, Clustering, and Retrieving System History [Cohen et al.] • Pinpoint [Chen et al.] • Magpie [Barham et al.] • Host-based intrusion detection • Fault localization in computer networks
Conclusion • Automatic root cause recognition using statistical learning on system behavior similarity • For the 4 cases, nearly 90% recognition accuracy can be achieved with appropriate preprocessing of system call traces • Sensitivity study shows the method is robust to available event details