280 likes | 355 Views
Ranking the Importance of Alerts for Problem Determination in Large Computer System. Guofei Jiang , Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton. Outline. Introduction Motivation & Goal System Invariants Invariants extraction Value propagation
E N D
Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories America, Princeton
Outline • Introduction • Motivation & Goal • System Invariants • Invariants extraction • Value propagation • Collaborative peer review mechanism • Rules & Fault model • Ranking alerts • Experiment result • Conclusion
Motivation • Large & complex systems are deployed by integrating many heterogeneous components: • servers, routers, storage & software from multiple vendors. • Hidden dependencies • Log/Performance data from components • Operators set many rules to check it and trigger alerts. • E.g. CPU% @ Web > 70% • Rule setting: independent & isolated • Operator’s own system knowledge.
Goal • Which alerts should we analyze first? Alert 1 CPU% @Web > 70% Alert 2 DiskUsg@Web > 150 Alert 3 CPU% @DB > 60% Alert 4 Network@AP > 35k • We introduce “Peer-review” mechanism • To rank the importance of alerts. • Operators can prioritize problem determinations process. Alert 3 Alert 4 Alert 1 Alert 2 - Get more consensus from others - Blend system management knowledge from multiple operators
t t t Alerts Ranking Process Invariants model [ICAC 2006] Off line [TDSC 2006] [TKDE 2007] [DSN 2006] Large system 1 1. Extract Invariants from monitoring data Full automation Alert 3 Alert 4 Alert 2 Alert 1 Alert 1 CPU% @Web > 70% Domain information Alert 2 DiskUsg@Web > 150 Alert 3 2. Define alert rules 3. Sort alert rules CPU% @DB > 60% 2 Operators (w/ domain knowledge) At time of alerts received Alert 4 Network@AP > 35k Online 4. Rank alerts Alert 1 Alert 1 Alert 1 Real alerts Alert 4
t t t t t t t System Invariants Target System t mn . . . mn m1 t Flow intensity: the intensity with which internal monitoring data reacts to the volume of user requests. . . . any constant relationship ??? m2 mi+2 User requests m3 mi+1 . . . mi m4 • User requests flow through system endlessly and many internal monitoring data react to the volume of user requests accordingly. • We search the relationships among these internal measurements collected at various points. • If modeled relationships continue to hold all the time, they can be regarded as invariants of the system.
Invariant Examples • Check implicit relationships, but not real values of flow intensities, which are always changing. However manyrelationships are constant!! • Example: x, y are changing but the equationy=f(x) is constant. Packet volume V1 Database Server Load Balancer O1 I1 O2 SQL query number N1 O3 Invariant V1 =f(N1) I1 = O1+O2+O3
Automated Invariants Search Target System Monitoring [t0-t1] [t1-t2] [tk-tk+1] observation data observation data observation data model library f pick any two measurements i, j to learn f ij with new data [t1-t2], do f ij hold ? with new data [tk-tk+1], do f ij hold ? Template Yes Yes P1 PK f ij: Invariant candidates P0 NO NO drop the variants f ij drop the variants f ij Pi: Confidence Score Sequential validation
One example in model library • We use an AutoRegressive model with eXogenous (ARX) to learn the relationship between two flow intensity measurements. • Define • Given a sequence of real observations, using LMS, we learn the model by minimizing the error. • A fitness function can be used to evaluate how well the learned model fits the real data.
Value Propagation with Invariants With ARX Model Set Converged Multi hops y z=g(y) y=f(x) z=g(f(x)) z Extract invariants v=s(u) v=s(h(x)) v x u=h(x) u
Rules and Fault Model Rule Predicate Action False positive 1 Ideal model Probability of fault occurrence Realistic model 0 x xT False negative Fault model for each rule
Probability of Reporting a True Positive Alert A very small false positive rate leads to large number of false positive repots. • Importance of an alert: Ex. One measurement is checked every minuteand its FP rate is 0.1% => 60x24x365x0.1% = 526 FP reports for a year! => What if thousands of measurements are there!!! Ex. Real operation support system: 80% of reports are FPs Probability of Reporting a True Positive (PRTP) generated by value x
Local Context Mapping to Global Context Web AP CPU%Web = fa(Network@AP) CPU%Web = fb(CPU%@DB) CPU%Web = fc(DiskUsg%@Web) Global context Different semantics DB Fault model of CPU%Web Prob(true|XCPU@DB) Alert 3 1 Alert 1 PRTP CPU% @Web > 70% Alert 1 > Prob(true|XT) Alert 2 xNetwork@AP DiskUsg@Web > 150 = fa(Network@AP) > Prob(true|XDiskUsg@Web) Alert 2 Alert 3 CPU% @DB > 60% > Prob(true|XNetwork@AP) Alert 4 Alert 4 Network@AP > 35k 0 x xDiskUsg@WEB xT xCPU@DB = fc(DiskUsg@WEB) = fb(CPU%@AP)
Local Context Mapping to Global Context Web AP DB Fault model of Network%AP Prob(true|XCPU@DB) Alert 3 1 Alert 1 PRTP CPU% @Web > 70% Alert 1 > Prob(true|XCPU@WEB) Alert 2 xT DiskUsg@Web > 150 > Prob(true|XDiskUsg@Web) Alert 2 Alert 3 CPU% @DB > 60% > Prob(true|XT) Alert 4 xCPU@DB Alert 4 Network@AP > 35k 0 x Alert ranking: No Change xDiskUsg@WEB xCPU@WEB
Alerts Ranking Process At time of alerts received Online 4. Rank alerts Alert 1 Alert 1 Alert 1 Real alerts Alert 4
Ranking Alerts (Case I) Case I: Receive ONLY ALERTS, no monitoring data from components Sorted alert rules Alerts ranking Alert 3 Alert 3 1 System Invariants Network Alert 7 Alert 7 2 Alert 2 Alert 2 3 Alert 6 4 Alert 1 Alert 1 5 Alert 9 Operator’s knowledge & configuration 5 alerts generated Alert 5 Alert 5 Alert 4 Alert 8
Ranking Alerts (Case II) Case II: Receive both alerts and monitoring data from components Number of Threshold Violations (NTV) Fault model of CPU%Web Fault model of Network%AP = fa(Network@AP) 1 1 = fc(DiskUsg@WEB) = fb(CPU%@AP) PRTP PRTP xNetwork@AP xT NTV=2 NTV=3 Observed Value X(CPU%Web) xCPU@DB Observed Value X(Network%AP) 0 0 x x Alert by CPU%Web is more important than one from Network%AP. xDiskUsg@WEB xDiskUsg@WEB xCPU@WEB xT xCPU@DB
Index • Introduction • Motivation & Goal • System Invariants • Invariants extraction • Value propagation • Collaborative peer review mechanism • Rules & Fault model • Ranking alerts • Experiment result • Conclusion
Experimental system Flow Intensities: : the number of EJB created at time t. : the JVM processing time at time t. : the number of SQL queries at time t. A B C D A B Invariant Examples: C D
Extracted Invariants Network m3 m5 m1 m2 m6 m4
Thresholds of Measurements m1 m2 m3 m4 m5 m6 T T T T T T m2 m3 m1 m5 m4 m6 70 4 30000 5 63.6 80 3 70.2 2 70.5 30000 1 77.0 70 6 59.8 20000
Thresholds of Measurements m1 m2 m3 m4 m5 m6 T T T T T T m6 m1 m2 m4 m5 m3 70 23208 4 32726 78.0 62.8 29540 30000 21200 5 63.6 71.4 57.4 27018 80 63.0 23291 3 70.2 33006 29646 33212 2 70.5 81.0 63.7 23509 30000 1 77.0 86.4 70 25688 32613 36316 6 59.8 28207 66.9 54.1 20000 25469
Ranking Alerts with NTVs (1) m1 70 m2 m3 m4 m5 m6 T T T T T T m1 m2 m4 m6 m3 m5 23208 32726 78.0 29540 62.8 63.6 30000 21200 71.4 27018 57.4 70.2 80 63.0 23291 33006 29646 70.5 33212 30000 23509 81.0 63.7 77.0 25688 86.4 70 36316 32613 Observed value 81.6 30621 71.4 22620 59.8 73.6 34319 NTVs 5 5 6 2 28207 66.9 25469 54.1 20000 5 5 1 2 2 2 2 6
Ranking Alerts with NTVs (2) m1 70 m2 m3 m4 m5 m6 T T T T T T m4 m2 m3 m6 m5 m1 23208 32726 78.0 29540 62.8 63.6 30000 21200 71.4 27018 57.4 70.2 80 63.0 23291 33006 29646 70.5 33212 23509 81.0 63.7 30000 77.0 25688 86.4 70 36316 32613 Observed value 54.6 22712 46.1 18564 59.8 73.5 31478 NTVs - - - - 20000 28207 66.9 25469 54.1 5 2 1 2
Ranking Alerts with NTVs (2) Inject a problem (SCP copy) to Web server
Conclusion • We introduce a peer review mechanism to rank alerts from heterogeneous components • By mapping local thresholds of various rules into their equivalent values in a global context • Based on system invariants network model • To support operators’ consultation for prioritization of problem determination.
Thank You! • Questions?