1 / 36

Automated Known Problem Diagnosis with Event Traces

Automated Known Problem Diagnosis with Event Traces. Chun Yuan , Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, Wei-Ying Ma Microsoft Research , Tsinghua, USTC. Motivation . Computer problem diagnosis: inefficient & inaccurate Too much human involvement

lel
Download Presentation

Automated Known Problem Diagnosis with Event Traces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Known Problem Diagnosis with Event Traces Chun Yuan, Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, Wei-Ying Ma Microsoft Research, Tsinghua, USTC

  2. Motivation • Computer problem diagnosis: inefficient & inaccurate • Too much human involvement • Product feedback channel becoming mature • Diagnosis process remaining labor intensive • Solutions need manual inspection • Automating diagnosis process for known problems • Text-based symptom matching • Leveragingsystem information • State-based diagnosis + root cause ranking [Strider] • Online crash analysis • Mechanized troubleshooting steps • How to represent a known problem? How to recognize a known problem?

  3. Problem Representation and Recognition Methods

  4. Problem Representation and Recognition Methods

  5. Our Approach • Known problems are annotated with relevant system behavior • New behavior is classified to some known problem based on similarity to existing behavior • Event traces as typical exhibitions of system behavior

  6. Labeling Known Problem DB Workflow (2) Troubleshooter Classifier (3) (1) (4) Apps Tracer

  7. Tracer • What events to collect? • System calls • What attributes to collect? • Process/thread ID • Process/thread name • System call name, parameters, return value

  8. Trace Example # process thread syscall paramaters & return value ... 18419 iexplore.exe 3892 CreateThread Process: 3888, Thread: 3896 SUCCESS 18420 iexplore.exe 3892 PostMessage WM_USER+0x300 1 18421 iexplore.exe 3892 OpenKey HKCU\SOFTWARE\Microsoft\Internet Explorer\Main SUCCESS 18422 iexplore.exe 3892 QueryValue HKCU\SOFTWARE\Microsoft\Internet Explorer\Main\Enable Browser Extensions NOTFOUND 18423 iexplore.exe 3892 OpenKey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings SUCCESS 18424 iexplore.exe 3892 QueryValue HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings\DontUseDNSLoadBalancing NOTFOUND 18425 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\Component SUCCESS 18426 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\Contact SUCCESS 18427 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\Distribute SUCCESS 18428 InoRT.exe 256 OpenKey HKEY_LOCAL_MACHINE\SOFTWARE\ComputerAssociates\eTrustAntivirus\CurrentVersion\InocDll Views SUCCESS 18429 iexplore.exe 3892 CloseKey HKCU\SOFTWARE\Microsoft\Internet Explorer\Main SUCCESS 18430 iexplore.exe 3892 CloseKey HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Internet Settings SUCCESS 18431 iexplore.exe 3892 PeekMessage WM_LBUTTONDOWN 1 18433 CcmExec.exe 3068 OpenKey HKCU SUCCESS 18434 CcmExec.exe 3068 OpenKey HKCU\Software\Policies\Microsoft\Control Panel\Desktop NOTFOUND 18435 InoTask.exe 2024 ReadFile \\??\C:\Program Files\MSDN\2003FEB\1033\dnmscs.hxi SUCCESS 18436 CcmExec.exe 3068 OpenKey HKCU\Control Panel\Desktop SUCCESS 18437 CcmExec.exe 3068 QueryValue HKCU\Control Panel\Desktop\MultiUILanguageId NOTFOUND 18438 CcmExec.exe 3068 CloseKey HKCU\Control Panel\Desktop SUCCESS 18439 CcmExec.exe 3068 CloseKey HKCU SUCCESS 18440 CcmExec.exe 3068 OpenFile \\??\C:\WINDOWS\System32\dwwin.exe SUCCESS 18441 CcmExec.exe 3068 CreateFile \\??\C:\WINDOWS\System32\dwwin.exe SUCCESS 18442 CcmExec.exe 3068 QueryValue HKCU\Control Panel\Desktop\MultiUILanguageId NOTFOUND 18443 CcmExec.exe 3068 CloseKey HKCU\Control Panel\Desktop SUCCESS ...

  9. Classifier • Classification • Training: learn a model from annotated training data • Testing: predict the class a novel one belong to • Accuracy • Percentage of true positive data • Cross-validation • Feature: N-gram • Any N successive elements in a sequence • Feature vector • Given a set of sequences, let {g1, g2, …, g_K} be all unique N-grams occurring in any of the sequences • A sequence is represented by the vector (b1,…,b_K) • b_k = 1 if g_k is in the sequnce • b_k = 0 otherwise

  10. Support Vector Machines • The decision boundary should be as far away from the data of both classes as possible, that is, maximizing the margin, m Class 2 m Class 1

  11. Comparing System Call Traces • Understand system call variations • Sequence alignment [abcdb] [abcd b ] [adabe] => [a dabe ] [bcdae] [ bcda e ] [abcabf] [abc ab f] • Uniform ordering • Group by process/thread • Sort by process/thread name

  12. A B C System Call Variation • Cross-time comparison • Noise filtering • Patterns occurring at less than Th_t% times are discarded • Object name canonicalization • Registry path discarded • Case-by-case rules

  13. System Call Variation • Cross-machine comparison • Noise filtering • Patterns occurring on less than Th_m% machines are discarded • Canonicalization • File path discarded

  14. Evaluation • Select target problems • Collect traces with fault injection • Parameter sensitivity (cross-time & cross-machine) • Noise filtering threshold • Canonicalization or not • Pattern length • System call attributes availability

  15. Selected Problems

  16. Data Collection • For each machine • For each round • For each problem • Inject the fault into the machine • Repeat a number of times • Start tracer • Replay the UI script to reproduce the symptom • Stop tracer • Remove the fault • Cross-time: 2 cases, 5 machines, 6 rounds (3 for training) • Cross-machine: 4 cases, 20 machines (10 for training)

  17. Cross-time Evaluation

  18. Noise Filtering Threshold • Impact on accuracy IeDisplay SharedFolder

  19. Noise Filtering Threshold • Impact on feature space dimension and training time SharedFolder IeDisplay

  20. Canonicalization IeDisplay SharedFolder

  21. Pattern Length IeDisplay SharedFolder

  22. Attributes Availability • No thread name, system call parameters, return value IeDisplay SharedFolder

  23. Summary of Cross-time Results • Noise filtering substantially reduces feature space and training time • Canonicalization has no effect • Longer patterns only helpful when fewer attributes are available

  24. Cross-machine Evaluation

  25. Noise Filtering Threshold

  26. Noise Filtering Threshold • Feature space dimension and training time

  27. Canonicalization

  28. Pattern Length

  29. Pattern Length • Sort traces by system call name

  30. Attributes Availability • Process name + system call name

  31. Convergence (a) 1-gram, full system call record (b) 2-gram, process name + system call name

  32. Best Results (a) 1-gram, full system call record (b) 2-gram, process name + system call name

  33. Summary of Cross-machine Results • Canonicalization is very effective • 1-gram is good enough; 2-gram is useful when only process name and system call name are available • Accuracy converges very quickly with more training machines

  34. Related Work • Using computers to diagnose computer problems [Redstone et al.] • Capturing, Indexing, Clustering, and Retrieving System History [Cohen et al.] • Pinpoint [Chen et al.] • Magpie [Barham et al.] • Host-based intrusion detection • Fault localization in computer networks

  35. Conclusion • Automatic root cause recognition using statistical learning on system behavior similarity • For the 4 cases, nearly 90% recognition accuracy can be achieved with appropriate preprocessing of system call traces • Sensitivity study shows the method is robust to available event details

  36. Questions?

More Related