1 / 28

Ranking the Importance of Alerts for Problem Determination in Large Computer System

This study outlines a method to rank alerts for problem determination in large computer systems by using a collaborative peer-review mechanism. By extracting system invariants and utilizing value propagation with an ARX model, operators can prioritize and analyze alerts efficiently. The proposed approach aims to improve problem determination processes in complex systems with hidden dependencies. Experimental results show the effectiveness of this method in alert ranking.

minowa
Download Presentation

Ranking the Importance of Alerts for Problem Determination in Large Computer System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton

  2. Outline • Introduction • Motivation & Goal • System Invariants • Invariants extraction • Value propagation • Collaborative peer review mechanism • Rules & Fault model • Ranking alerts • Experiment result • Conclusion

  3. Motivation • Large & complex systems are deployed by integrating many heterogeneous components: • servers, routers, storage & software from multiple vendors. • Hidden dependencies • Log/Performance data from components • Operators set many rules to check it and trigger alerts. • E.g. CPU% @ Web > 70% • Rule setting: independent & isolated • Operator’s own system knowledge.

  4. Goal • Which alerts should we analyze first? Alert 1 CPU% @Web > 70% Alert 2 DiskUsg@Web > 150 Alert 3 CPU% @DB > 60% Alert 4 Network@AP > 35k • We introduce “Peer-review” mechanism • To rank the importance of alerts. • Operators can prioritize problem determinations process. Alert 3 Alert 4 Alert 1 Alert 2 - Get more consensus from others - Blend system management knowledge from multiple operators

  5. t t t Alerts Ranking Process Invariants model [ICAC 2006] Off line [TDSC 2006] [TKDE 2007] [DSN 2006] Large system 1 1. Extract Invariants from monitoring data Full automation Alert 3 Alert 4 Alert 2 Alert 1 Alert 1 CPU% @Web > 70% Domain information Alert 2 DiskUsg@Web > 150 Alert 3 2. Define alert rules 3. Sort alert rules CPU% @DB > 60% 2 Operators (w/ domain knowledge) At time of alerts received Alert 4 Network@AP > 35k Online 4. Rank alerts Alert 1 Alert 1 Alert 1 Real alerts Alert 4

  6. t t t t t t t System Invariants Target System t mn . . . mn m1 t Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests. . . . any constant relationship ??? m2 mi+2 User requests m3 mi+1 . . . mi m4 • User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly. • We search the relationships among these internal measurements collected at various points. • If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.

  7. Invariant Examples • Check implicit relationships, but not real values of flow intensities, which are always changing. However manyrelationships are constant!! • Example: x, y are changing but the equationy=f(x) is constant. Packet volume V1 Database Server Load Balancer O1 I1 O2 SQL query number N1 O3 Invariant V1 =f(N1) I1 = O1+O2+O3

  8. Automated Invariants Search Target System Monitoring [t0-t1] [t1-t2] [tk-tk+1] observation data observation data observation data model library f pick any two measurements i, j to learn f ij with new data [t1-t2], do f ij hold ? with new data [tk-tk+1], do f ij hold ? Template Yes Yes P1 PK f ij: Invariant candidates P0 NO NO drop the variants f ij drop the variants f ij Pi: Confidence Score Sequential validation

  9. One example in model library • We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements. • Define • Given a sequence of real observations, using LMS, we learn the model by minimizing the error. • A fitness function can be used to evaluate how well the learned model fits the real data.

  10. Value Propagation with Invariants With ARX Model Set Converged Multi hops y z=g(y) y=f(x) z=g(f(x)) z Extract invariants v=s(u) v=s(h(x)) v x u=h(x) u

  11. Rules and Fault Model Rule Predicate Action False positive 1 Ideal model Probability of fault occurrence Realistic model 0 x xT False negative Fault model for each rule

  12. Probability of Reporting a True Positive Alert A very small false positive rate leads to large number of false positive repots. • Importance of an alert: Ex. One measurement is checked every minuteand its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!! Ex. Real operation support system: 80% of reports are FPs Probability of Reporting a True Positive (PRTP) generated by value x

  13. Local Context Mapping to Global Context Web AP CPU%Web = fa(Network@AP) CPU%Web = fb(CPU%@DB) CPU%Web = fc(DiskUsg%@Web) Global context Different semantics DB Fault model of CPU%Web Prob(true|XCPU@DB) Alert 3 1 Alert 1 PRTP CPU% @Web > 70% Alert 1 > Prob(true|XT) Alert 2 xNetwork@AP DiskUsg@Web > 150 = fa(Network@AP) > Prob(true|XDiskUsg@Web) Alert 2 Alert 3 CPU% @DB > 60% > Prob(true|XNetwork@AP) Alert 4 Alert 4 Network@AP > 35k 0 x xDiskUsg@WEB xT xCPU@DB = fc(DiskUsg@WEB) = fb(CPU%@AP)

  14. Local Context Mapping to Global Context Web AP DB Fault model of Network%AP Prob(true|XCPU@DB) Alert 3 1 Alert 1 PRTP CPU% @Web > 70% Alert 1 > Prob(true|XCPU@WEB) Alert 2 xT DiskUsg@Web > 150 > Prob(true|XDiskUsg@Web) Alert 2 Alert 3 CPU% @DB > 60% > Prob(true|XT) Alert 4 xCPU@DB Alert 4 Network@AP > 35k 0 x Alert ranking: No Change xDiskUsg@WEB xCPU@WEB

  15. Alerts Ranking Process At time of alerts received Online 4. Rank alerts Alert 1 Alert 1 Alert 1 Real alerts Alert 4

  16. Ranking Alerts (Case I) Case I: Receive ONLY ALERTS, no monitoring data from components Sorted alert rules Alerts ranking Alert 3 Alert 3 1 System Invariants Network Alert 7 Alert 7 2 Alert 2 Alert 2 3 Alert 6 4 Alert 1 Alert 1 5 Alert 9 Operator’s knowledge & configuration 5 alerts generated Alert 5 Alert 5 Alert 4 Alert 8

  17. Ranking Alerts (Case II) Case II: Receive both alerts and monitoring data from components Number of Threshold Violations (NTV) Fault model of CPU%Web Fault model of Network%AP = fa(Network@AP) 1 1 = fc(DiskUsg@WEB) = fb(CPU%@AP) PRTP PRTP xNetwork@AP xT NTV=2 NTV=3 Observed Value X(CPU%Web) xCPU@DB Observed Value X(Network%AP) 0 0 x x Alert by CPU%Web is more important than one from Network%AP. xDiskUsg@WEB xDiskUsg@WEB xCPU@WEB xT xCPU@DB

  18. Index • Introduction • Motivation & Goal • System Invariants • Invariants extraction • Value propagation • Collaborative peer review mechanism • Rules & Fault model • Ranking alerts • Experiment result • Conclusion

  19. Experimental system Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. A B C D A B Invariant Examples: C D

  20. Extracted Invariants Network m3 m5 m1 m2 m6 m4

  21. Thresholds of Measurements m1 m2 m3 m4 m5 m6 T T T T T T m2 m3 m1 m5 m4 m6 70 4 30000 5 63.6 80 3 70.2 2 70.5 30000 1 77.0 70 6 59.8 20000

  22. Thresholds of Measurements m1 m2 m3 m4 m5 m6 T T T T T T m6 m1 m2 m4 m5 m3 70 23208 4 32726 78.0 62.8 29540 30000 21200 5 63.6 71.4 57.4 27018 80 63.0 23291 3 70.2 33006 29646 33212 2 70.5 81.0 63.7 23509 30000 1 77.0 86.4 70 25688 32613 36316 6 59.8 28207 66.9 54.1 20000 25469

  23. Ranking Alerts with NTVs (1) m1 70 m2 m3 m4 m5 m6 T T T T T T m1 m2 m4 m6 m3 m5 23208 32726 78.0 29540 62.8 63.6 30000 21200 71.4 27018 57.4 70.2 80 63.0 23291 33006 29646 70.5 33212 30000 23509 81.0 63.7 77.0 25688 86.4 70 36316 32613 Observed value 81.6 30621 71.4 22620 59.8 73.6 34319 NTVs 5 5 6 2 28207 66.9 25469 54.1 20000 5 5 1 2 2 2 2 6

  24. Ranking Alerts with NTVs (1)

  25. Ranking Alerts with NTVs (2) m1 70 m2 m3 m4 m5 m6 T T T T T T m4 m2 m3 m6 m5 m1 23208 32726 78.0 29540 62.8 63.6 30000 21200 71.4 27018 57.4 70.2 80 63.0 23291 33006 29646 70.5 33212 23509 81.0 63.7 30000 77.0 25688 86.4 70 36316 32613 Observed value 54.6 22712 46.1 18564 59.8 73.5 31478 NTVs - - - - 20000 28207 66.9 25469 54.1 5 2 1 2

  26. Ranking Alerts with NTVs (2) Inject a problem (SCP copy) to Web server

  27. Conclusion • We introduce a peer review mechanism to rank alerts from heterogeneous components • By mapping local thresholds of various rules into their equivalent values in a global context • Based on system invariants network model • To support operators’ consultation for prioritization of problem determination.

  28. Thank You! • Questions?

More Related