390 likes | 447 Views
Effective Diagnosis of Routing Disruptions from End Systems. Ying Zhang Z. Morley Mao Ming Zhang. AS A. Routing disruptions impact application performance. More applications today have high QoS requirements Routing events can cause high loss and long delays. AS B. AS C.
E N D
Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang
AS A Routing disruptions impact application performance • More applications today have high QoS requirements • Routing events can cause high loss and long delays AS B AS C AS D AS E Internet Dst Src
AS A Existing approaches to diagnose routing disruptions are ISP-centric • Require routing data from many routers in ISPs [Feldmann04, Teixeira04, Wu05] • Passive and accurate BGP collectors AS D AS C AS B Internet
AS A Limitations of ISP-centric approaches • Difficult to gain access to data from many ISPs • BGP data reflects “expected” data-plane paths ISP ? ? ? End-systems AS D AS C AS B ? ? ? ? Internet
Can we diagnose entirely from end systems? • Goal: infer data-plane paths of many routers Probing host AS C ISP A AS B AS D Dst
Our approach: end systems based monitoring • Only require probing from end hosts • Cover all the PoPs of a target ISP Probing host AS C Target ISP AS B AS D Dst
Our approach: end systems based monitoring • Cover most of the destinations on the Internet Probing host Dst Dst AS C ISP A AS B AS D Dst Dst
Our approach: end systems based monitoring • Identify routing changes by comparing paths measured consecutively Probing host AS C ISP A AS B AS D Dst
Advantages and challenges • Advantages: • No need to access to ISP-propriety data • Identify actual data-plane paths • Monitor data plane performance • Challenges: • Limited resources to probe • Coverage of probed paths • Timing granularity • Measurement noise
System architecture Collaborative probing Target ISP Event identification and classification Event correlation and inference Event impact analysis Target ISP Target ISP Reports
Outline • Collaborative probing • Event identification and classification • Event correlation and inference • Result and validation
Collaborative probing • Using a set of hosts • To learn the routing state • To improve coverage • To reduce overhead Probing host AS C ISP A AS B AS D
Outline • Collaborative probing • Event identification and classification • Event correlation and inference • Result and validation
Event classification • Classify events according to ingress/egress changes Type2: Ingress PoP same, egress PoP different Type1: Ingress PoP changes Type3: Ingress PoP same, egress PoP same Destination Prefix P Target ISP Probing host
Outline • Collaborative probing • Event identification and classification • Event correlation and inference • Result and validation
Likely causes: link failures Neighbor AS Destination Prefix P Old egress PoP New egress PoP Old path New path Target ISP Probing host 16
Likely causes: internal distance changes • Hot potato changes • Cost of old internal path increases • Cost of new internal path decreases Neighbor AS Old egress PoP New egress PoP distance: 120 distance: 80 distance: 100 distance: 120 17 Probing host
Event correlation • Spatial correlation: a single network failure often affects multiple routers • Temporal correlation: routing events occurring close together are likely due to only a few causes
Inference methodology • An evidence: an event that supports the cause Destination prefix P Link L Cause: Link L is down New egress New path Probing host Target ISP Probing host
Inference methodology • A conflict: a measurement trace that conflicts with the cause Destination prefix P Link L Cause: Link L is down New egress New path Probing host Target ISP Probing host
Inference methodology Evidence node [1,2,3]->[1,2,4] AS 3 AS 4 Withdrawal AS 2 Cause: node 3 withdraws the route AS 1 Cause: link 2-3 down
Inference methodology Evidence Graph Evidence node [1,2,3]->[1,2,4] Evidence node [0,2,3]->[0,2,4] AS 3 AS 4 Withdrawal AS 2 Cause: node 3 withdraws the route AS 1 AS 0 Cause: link 2-3 down
Inference methodology Conflict Graph AS 6 Conflict node [1,2,3,6] Conflict node [0,2,3,6] Conflict node [0,2,3] AS 3 AS 2 Cause: link 2-3 down Cause: node 3 withdraws the route AS 1 AS 0
Inference methodology Evidence Graph Conflict Graph • Greedy algorithm: minimum set of causes that can explain all the evidence while minimizing conflicts Conflict node [1,2,3,6] Conflict node [0,2,3,6] Conflict node [0,2,3] Evidence node [1,2,3]->[1,2,4] Evidence node [0,2,3]->[0,2,4] Evidence: 2 Conflicts: 3 Evidence: 2 Conflicts: 0
Outline • Collaborative probing • Event identification and classification • Event correlation and inference • Result and validation
Results of event classification • Many events are internal changes • Abilene has many ingress changes
Validation with BGP based approach [Wu05] • Hot potato changes: egress point changes due to internal distance changes Number of incidences identified by both Number of incidences identified by our method Number of incidences identified by BGP method False negative, false positives
Validation with BGP based approach • Session resets: peering link up/down • Inaccuracy reasons: • Limited coverage • Coarse-grained probing • Measurement noise
System performance • Can keep up with generated routing state • Applicable for real-time diagnosis and mitigation • Reactive: construct alternate paths to bypass the problem • Proactive: avoid paths with many historical routing disruptions
Conclusion • Developed the first system to diagnose routing disruptions purely from end systems • Used a simple greedy algorithm on two bipartite graphs to infer causes • Comprehensively validated the accuracy
Thank you! Questions?
Performance impact analysis • End-to-end latency changes caused by different types of routing events
Validation with BGP data • BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from a Tier-1 ISP • The destination prefix coverage and the routing event detection rate
Event classification: same ingress PoP, different egress PoP • Policy changes • Local preference in the old route decreases • Local preference in the new route increases Neighbor AS Local Pref : 60->110 Local Pref : 100->50 Old egress PoP New egress PoP Old path New path Target ISP 35 Probing host
Event classification: same ingress PoP, different egress PoP • External routing changes • Old route worsens due to external factors (withdrawal, longer AS path) • New route improves due to external factors AS A AS B ABCD->ABEFD BCEFD->BEFD Old egress PoP New egress PoP Old path New path Target ISP 36 Probing host
Event classification: same ingress PoP, same egress PoP • Internal PoP path changes • Cost of old internal path increases • Cost of new internal path decreases • External AS path changes Destination Prefix P New path Old path Target ISP 37 Probing host
Results of cause inference • Effectiveness of inference algorithm • Clusters: a group of events with the same root cause
Event identification • A routing event: path changes • Event identificationomparing continuous routing snapshots Probing host AS C ISP A AS B AS D Dst