1 / 22

BGP Anomaly Detection in an ISP

This paper presents a tool to identify important BGP anomalies in real-time, focusing on lost reachability, persistent flapping, and large traffic shifts. The tool utilizes BGP updates and netflow data to classify events and predict their impact on network traffic.

silas-flynn
Download Presentation

BGP Anomaly Detection in an ISP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs) http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf

  2. Goal • Identify important anomalies • Lost reachability • Persistent flapping • Large traffic shifts • Contributions: • Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time. • Use the tool to characterize routing disruptions in an operational network

  3. BR BR BR BR BR BR BR BR BR BR BR BR C C C C C C C C C C C C Capturing Routing Changes Large operational network (8/16/2004 – 10/10-2004) eBGP eBGP Updates Updates eBGP iBGP iBGP Best routes Best routes iBGP BGP Monitor CPE iBGP iBGP iBGP eBGP eBGP eBGP

  4. Challenges • Large volume of BGP updates • Millions daily, very bursty • Too much for an operator to manage • Different than root-cause analysis • Identify changes and their effects • Focus on actionable events • Diagnose causes only in/near the AS

  5. BGP Updates (106) Large Disruptions “Typed” Events Clusters (103) BR BR BR BR BR BR E E E E E E E E E E E E Events (105) (101) BGP Update Grouping Event Classification Event Correlation Traffic Impact Prediction Netflow Data Persistent Flapping Prefixes Frequent Flapping Prefixes (101) (101) System Architecture

  6. BR BR BR E E E E E E BGP Update Grouping BGP Updates Events Grouping BGP Update into Events Challenge: A single routing change • leads to multiple update messages • affects routing decisions at multiple routers • Solution: • Group all updates for a prefix with inter-arrival < 70 seconds • Flag prefixes with changes lasting > 10 minutes. Persistent Flapping Prefixes

  7. Grouping Thresholds • Based on data analysis and our understanding of BGP • Event timeout: 70 seconds • 2 * MRAI timer + 10 seconds • 98% inter-arrival time < 70 seconds • Convergence timeout: 10 minutes • BGP usually converges within minutes • 99.9% events < 10 minutes

  8. Persistent Flapping Prefixes • Causes of persistent flapping • Conservative damping parameters (78.6%) • Protocol oscillations due to MED (18.3%) • Unstable interface or BGP session (3.0%) Surprising finding: 15.2% of updates were caused by persistent flapping prefixes, even though flap damping was enabled!

  9. C B A D E E E E Example: Unstable eBGP Session • Flap damping parameters are session-based • Damping not implemented for iBGP sessions Peer ISP p Customer

  10. Event Classification Challenge: Major concerns in network management • Changes in reachability • Heavy load of routing messages on the routers • Change of flow of traffic through the network Event Classification “Typed” Events Events Solution: classify events by severity of their impacts

  11. C D A B E E E E E E Event Category – “No Disruption” p AS2 AS1 No Traffic Shift “No Disruption”: each of the border routers has no traffic shift. (50.3%) ISP

  12. C D A B E E E E E E Event Category – “Internal Disruption” p AS2 AS1 “Internal Disruption”: all of the traffic shifts are internal traffic shift. (15.6%) ISP Internal Traffic Shift

  13. C D A B E E E E E E Event Category – “Single External Disruption” p AS2 AS1 external Traffic Shift “Single External Disruption”: only one of the traffic shifts is external traffic shift. (20.7%) ISP

  14. Statistics on Event Classification • First 3 categories have significant variations from day to day • Updates per event depends on the type of events and the number of affected routers

  15. Event Correlation Challenge: A single routing change • affects multiple destination prefixes Event Correlation “Typed” Events Clusters Solution: group events of same type that occur close in time

  16. EBGP Session Reset • Caused most “single external disruption” events • Check if the number of prefixes using that session as the best route changes dramatically • Validation with Syslog router report (95%) Number of prefixes session recovery session failure time

  17. A B C E E E Hot-Potato Changes • Hot-Potato Changes • Caused “internal disruption” events • Validation with OSPF measurement (95%) [Teixeira et al – SIGMETRICS’ 04] P “Hot-potato routing” = route to closest egress point 11 9 10 ISP

  18. BR BR BR E E E E E E Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic Impact Prediction Large Disruptions Clusters Netflow Data Solution: weigh each cluster by traffic volume

  19. Traffic Impact Prediction • Traffic weight • Per-prefix measurement from Netflow • 10% prefixes accounts for 90% of traffic • Traffic weight of a cluster • Sum of “traffic weight” of the prefixes • A few clusters have large traffic weight • Mostly session resets & hot-potato changes

  20. Performance Evaluation • Memory • Static memory: “current routes”, 600 MB • Dynamic memory: “clusters”, 300 MB • Speed • 99% of intervals of 1 second of updates can be process within 1 second • Occasional execution lag • Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900MHz CPU

  21. Conclusion • BGP anomaly detection • Fast, online fashion • Operator concerns (reachability, flapping, traffic) • Significant information reduction • Uncovered important network behaviors • Persistent flapping prefixes • Hot-potato changes • Session resets and interface failures

  22. Detecting Peering Violations • Consistent export requirement • Peer should advertise prefixes at all peering points, with the same AS path length • Allows the AS to do hot-potato routing • Detecting violations • Using iBGP feeds from the border routers • Some inference tricks to identify inconsistencies • Results of the study • http://www.nanog.org/mtg-0410/feamster.html • http://www.cs.princeton.edu/~jrex/papers/imc04.pdf

More Related