190 likes | 337 Views
G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks. He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates. Abstract. Best effort networks --> QoS Manage end-to-end service quality as a whole Generic Root Cause Analysis (G-RCA)
E N D
G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks He Yan, Lee Breslau, Zihui Ge, Dan Massey, Dan Pei, Jennifer Yates
Abstract • Best effort networks --> QoS • Manage end-to-end service quality as a whole • Generic Root Cause Analysis (G-RCA) • Service Quality Management (SQM) • FCAPS
Introduction • Finding root to errors • transient errors • Gather information for network operators • Helps Service Quality Management (SQM) for ISPs.
G-RCA Architecture • Consists of five main components. • G-RCA determines where and when to look for diagnostic events. • Used for: • Troubleshoot ongoing networks • Investigate past behavior.
Data Collection and Management • Proactively collects data from network, such as alarms, logs and performance measurements. • Uses a data collector and database to store data • “Events” • event-name, location type, retrieval process and information
Service Dependency Model • Figure 2 used to include network elements associated with a problem • Hard to realize theory • Traffic sampling data • Snapshots of router configs
Spatial-Temporal Correlation (1) • How to relate what has happened to service problem? • G-RCA defines a temporal and spatial joining rule • Temporal Joining Rule • Defines a time window to allow symptom and diagnostic event to be joined. • 6 parameters for symptom & diagnostic event • Left expansion margin • Right expansion margin • Expanding option (Start/End, Start/Start or End/End)
Spatial-Temporal Correlation (2) • Symptom and diagnostic event are joint when the windows overlap.
Spatial-Temporal Correlation (3) • Spatial Joining Rule • Symptom event location type • Diagnostic event location type • Joining level • Joining level • Link symptom locations and diagnostic event locations together • Model diagnostic signatures using diagnosis graph • A symptom and diagnostic event pair is called diagnosis rule • G-RCA evaluates the time and location conditions and collected data • Determine whether diagnostic signature is present
Reasoning LogicRule-Based Reasoning Module • Priority value in the diagnosis graph • Assigned by operator • Higher value means more confidence on the diagnostic event to be the real root cause • Can be examined by G-RCA’s Result Browser • How does rule-based reasoning work?
Bayesian Inference • Determining the root cause is to identify the one producing the following maximum likelihood ratio: • When the features are conditionally independent • The second term can be decoupled to • Parameters configuration (ratios of: and ) • bootstrap using the rule-based reasoning • define a fuzzy type of discrete values • Low, Medium, and High, which corresponds to values 2, 100, and 20 000. Second term First term presence or absence of the diagnostic evidence and symptom events themselves : features Potential root causes: classes A set of r
Comparison • In the operational practice,rule-based reasoning logic is often preferred over Bayesian inference • Easier to configure • Gives simple and direct association between the diagnosed root cause and the evidence • Effective in most applications • However, there are a few cases where Bayesian inference is preferred • Root cause condition is unobservable
Domain Knowledge Building • Issue: The specification of a diagnosis graph for a SQM application offered by an operator, especially the initial version, can be inaccurate and incomplete. • G-RCA addresses this concern regarding incomplete diagnosis graph through iteratively using the Correlation Tester and Result Browser. • Firstly, operator filters out the symptom events with known root causes with the root cause classification capability provided in the result browser. • Secondly, operator could focus on the rest of symptom events by comparing with other suspected diagnostic events that occur at the same time and that are spatially related to the service problem.
Domain Knowledge Building • On one hand, the second step can be done via manual drill-down and data exploration capability in the result browser; • On the other hand, operators can also to run the correlation tester blindly between the symptom events without known root causes and each type of suspected diagnostic graph. • As G-RCA emphasizes usability, the newly uncovered diagnosis rules need to be verified by operators before incorporating into the diagnosis graph.
Introduction of G-RCA Applications • The key advantage of G-RCA in SQM is its capability to be rapidly customized into different RCA applications in the ISP’s network. • In this section, the following three case studies are included in order to demonstrate effectiveness of G-RCA • 1) customer BGP flaps • 2) end-to-end throughput management in a CDN service • 3) network PIM flaps in multicast VPN
BGP Flaps Root Cause Analysis Purpose: Understanding the root cause of flaps. • Achieving this using G-RCA by constructing application specific events and rules. • Starting by constructing our BGP flap-specific events. • Adding a few application-specific diagnosis rules. • Specifying priorities for different diagnosis rules for BGP flaps RCA. (Please refer to the figure of “Diagnosis graph for BGP flaps root cause analysis” shown in the previous slides) Application-specific events for BGP flaps root cause analysis
Conclusion 1. It captures the layered network model in its knowledge library, by implementing • temporal/spatial correlation, • rule-based reasoning, and • Bayesian inference. 2. Domain knowledge in existing RCA application can be refined by the interaction between the RCA engine and the Correlation Tester. 3. In order to analyse a large number of service quality issues and classify trend their root causes, it proactively collects all types of data from different sources and normalize them in real time.