1 / 23

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network. Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University) Jia Wang (AT&T Labs Research) . Failure. Mitigation. Congestion. Disruption. BR.

henry
Download Presentation

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University) Jia Wang (AT&T Labs Research)

  2. Failure Mitigation Congestion Disruption BR BR BR BR C C C C Motivation destination AS4 AS2 AS3 B A C AS1 A backbone network is vulnerable to routing changes that occur in other domains. D source

  3. Goal • Identify important routing anomalies • Lost reachability • Persistent flapping • Large traffic shifts • Contributions: • Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time. • Use the tool to characterize routing disruptions in anoperational network

  4. BR BR BR BR BR BR BR BR C C C C C C C C Interdomain Routing:Border Gateway Protocol “I can reach 12.34.158.0/24 via AS 1” “I can reach 12.34.158.0/24” AS 2 AS 1 eBGP AS 3 eBGP iBGP data traffic 12.34.158.0/24 data traffic 12.34.158.5 • Prefix-based: one route per prefix • Path-vector: list of ASes in the path • Incremental: every update indicates a change • Policy-based: local ranking of routes

  5. BR BR BR BR BR BR BR BR BR BR BR BR C C C C C C C C C C C C Capturing Routing Changes A large operational network (8/16/2004 – 10/10-2004) eBGP Updates Updates eBGP iBGP Best routes Best routes iBGP BGP Monitor CPE iBGP eBGP

  6. Challenges • Large volume of BGP updates • Millions daily, very bursty • Too much for an operator to manage • Different from root-cause analysis • Identify changes and their effects • Focus on actionable events rather than diagnosis • Diagnose causes in/near the AS

  7. BGP Updates (106) Large Disruptions “Typed” Events Clusters (103) BR BR BR BR BR BR E E E E E E E E E E E E Events (105) (101) BGP Update Grouping Event Classification Event Correlation Traffic Impact Prediction Netflow Data Persistent Flapping Prefixes Frequent Flapping Prefixes (101) (101) System Architecture From millions of updates to a few dozen reports

  8. BR BR BR E E E E E E Grouping BGP Update into Events Challenge: A single routing change • leads to multiple update messages • affects routing decisions at multiple routers • Approach: • Group together all updates for a prefix with • inter-arrival < 70 seconds • Flag prefixes with changes lasting > 10 minutes. BGP Update Grouping BGP Updates Events Persistent Flapping Prefixes

  9. Grouping Thresholds • Based on our understanding of BGP and data analysis • Event timeout: 70 seconds • 2 * MRAI timer + 10 seconds • 98% inter-arrival time < 70 seconds • Convergence timeout: 10 minutes • BGP usually converges within a few minutes • 99.9% events < 10 minutes

  10. Persistent Flapping Prefixes • Types of persistent flapping • Conservative damping parameters (78.6%) • Protocol oscillations due to MED (18.3%) • Unstable interfaces or BGP sessions (3.0%) A surprising finding: 15.2% of updates were caused by persistent-flapping prefixes even though flap damping is enabled.

  11. C B A D E E E E Example: Unstable eBGP Session • Flap damping parameters is session-based • Damping not implemented for iBGP sessions Peer ISP p Customer

  12. Event Classification Challenge: Major concerns in network management • Changes in reachability • Heavy load of routing messages on the routers • Change of flow of the traffic through the network Event Classification “Typed” Events, e.g., Loss/Gain of Reachability Events Solution: classify events by severity of their impact

  13. C D A B E E E E E E Event Category – “No Disruption” p AS2 AS1 No Traffic Shift “No Disruption”: no border routers have any traffic shift. (50.3%) ISP

  14. C D A B E E E E E E Event Category – “Internal Disruption” p AS2 AS1 “Internal Disruption”: all traffic shifts are internal. (15.6%) ISP Internal Traffic Shift

  15. C D A B E E E E E E Event Category – “Single External Disruption” p AS2 AS1 external Traffic Shift “Single External Disruption”: only one of the traffic shifts is external (20.7%) ISP

  16. Statistics on Event Classification • First 3 categories have significant day-to-day variations • Updates per event depends on the type of events and the number of affected routers

  17. Event Correlation Challenge: A single routing change • affects multiple destination prefixes Event Correlation “Typed” Events Clusters Solution: group the same-type, close-occurring events

  18. EBGP Session Reset • Caused most of “single external disruption” events • Check if the number of prefixes using that session as the best route changes dramatically • Validation with Syslog router report (95%) Number of prefixes session recovery session failure time

  19. A B C E E E Hot-Potato Changes • Hot-Potato Changes • Caused “internal disruption” events • Validation with OSPF measurement (95%) [Teixeira et al – SIGMETRICS’ 04] P “Hot-potato routing” = route to closest egress point 11 9 10 ISP

  20. BR BR BR E E E E E E Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic Impact Prediction Large Disruptions Clusters Netflow Data Solution: weigh each cluster by traffic volume

  21. Traffic Impact Prediction • Traffic weight • Per-prefix measurement from netflow • 10% prefixes accounts for 90% of traffic • Traffic weight of a cluster • the sum of “traffic weight” of the prefixes • A small number of large clusters have large traffic weight • Mostly session resets and hot-potato changes

  22. Performance Evaluation • Memory • Static memory: “current routes”, 600 MB • Dynamic memory: “clusters”, 300 MB • Speed • 99% of intervals of 1 second of updates can be process within 1 second • Occasional execution lag • Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900MHz CPU

  23. Conclusion • BGP troubleshooting system • Fast, online fashion • Operators’ concerns (reachability, flapping, traffic) • Significant information reduction • millions of update  a few dozens of large disruptions • Uncovered important network behavior • Hot-Potato changes • Session resets • Persistent-flapping prefixes

More Related