370 likes | 662 Views
Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS. Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm. Outline. Traffic SNMP link statistics Packet and flow monitoring Network topology IP routers and links
E N D
Internet Routing (COS 598A)Today: Detecting Anomalies Inside an AS Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm
Outline • Traffic • SNMP link statistics • Packet and flow monitoring • Network topology • IP routers and links • Fault data, layer-2 topology, and configuration • Intradomain route monitoring • Interdomain routes • BGP route monitoring • Analysis of BGP update data • Conclusions
Why is Traffic Measurement Important? • Billing the customer • Measure usage on links to/from customers • Applying billing model to generate a bill • Traffic engineering and capacity planning • Measure the traffic matrix (i.e., offered load) • Tune routing protocol or add new capacity • Denial-of-service attack detection • Identify anomalies in the traffic • Configure routers to block the offending traffic • Analyze application-level issues • Evaluate benefits of deploying a Web caching proxy • Quantify fraction of traffic that is P2P file sharing
Collecting Traffic Data: SNMP • Simple Network Management Protocol • Standard Management Information Base (MIB) • Protocol for querying the MIBs • Advantage: ubiquitous • Supported on all networking equipment • Multiple products for polling and analyzing data • Disadvantages: dumb • Coarse granularity of the measurement data • E.g., number of byte/packet per interface per 5 minutes • Cannot express complex queries on the data • Unreliable delivery of the data using UDP
Collecting Traffic Data: Packet Monitoring • Packet monitoring • Passively collecting IP packets on a link • Recording IP, TCP/UDP, or application-layer traces • Advantages: details • Fine-grain timing information • E.g., can analyze the burstiness of the traffic • Fine-grain packet contents • Addresses, port numbers, TCP flags, URLs, etc. • Disadvantages: overhead • Hard to keep up with high-speed links • Often requires a separate monitoring device
Collecting Traffic Data: Flow Statistics • Flow monitoring (e.g., Cisco Netflow) • Statistics about groups of related packets (e.g., same IP/TCP headers and close in time) • Recording header information, counts, and time • Advantages: detail with less overhead • Almost as good as packet monitoring, except no fine-grain timing information or packet contents • Often implemented directly on the interface card • Disadvantages: trade-off detail and overhead • Less detail than packet monitoring • Less ubiquitous than SNMP statistics
Using the Traffic Data in Network Operations • SNMP byte/packet counts: everywhere • Tracking link utilizations and detecting anomalies • Generating bills for traffic on customer links • Inference of the offered load (i.e., traffic matrix) • Packet monitoring: selected locations • Analyzing the small time-scale behavior of traffic • Troubleshooting specific problems on demand • Flow monitoring: selective, e.g,. network edge • Tracking the application mix • Direct computation of the traffic matrix • Input to denial-of-service attack detection
IP Topology • Topology information • Routers • Links, and their capacities • Internal links inside the AS • Edge links connecting to neighboring domains • Ways to learn the topology • Inventory database • SNMP polling/traps • Traceroute • Route monitoring • Router configuration data
Below IP • Layer-2 paths • ATM virtual circuits • Frame Relay virtual circuits • Mapping to lower layers • Specific fibers • Shared optical amplifiers • Shared conduits • Physical length (propagation delay) • Information not visible to IP • Stored in an inventory database • Not necessarily generated/updated automatically
Intradomain Monitoring: OSPF Protocol • Link-state protocol • Routers flood Link State Advertisements (LSAs) • Routers compute shortest paths based on weights • Routers identify next-hop to reach other routers 2 1 3 1 3 2 1 5 4 3
Intradomain Route Monitoring • Construct continuous view of topology • Detect when equipment goes up or down • Input to traffic-engineering and planning tools • Detect routing anomalies • Identify failures, LSA storms, and route flaps • Verify that LSA load matches expectations • Flag strange weight settings as misconfigurations • Analyze convergence delay • Monitor LSAs in multiple locations with go • Compare the times when LSAs arrive • Detect router implementation mistakes
Passive Collection of LSAs • OSPF is a flooding protocol • Every LSA sent on every participating link • Very helpful for simplifying the monitor • Can participate in the protocol • Shared media (e.g., Ethernet) • Join multicast group and listen to LSAs • Point-to-point links • Establish an adjacency with a router • … or passively monitor packets on a link • Tap a link and capture the OSPF packets
Reducing the Volume of Information • Prioritizing the messages • Router failure over router recovery • Link failure or weight change over a refresh • Informational messages about weight settings • Grouping related messages • Link failure: group messages for the two ends • Router failure: group the affected links • Common failure: group links failing close in time
Anomalies Found in the Shaikh04 paper • Intermittent hardware problem • Router periodically losing OSPF adjacencies • Risk of network partition if 2nd failure occurred • External link flaps • Congestion on edge link causing lost messages • Lost adjacency leading to flapping routes • Configuration errors • Two routers assigned the same IP address • Inefficient config leading to duplicate LSAs • Vendor implementation bug • More frequent refreshing of LSAs than specified
Motivation for BGP Monitoring • Visibility into external destinations • What neighboring ASes are telling you • How you are reaching external destinations • Detecting anomalies • Increases in number of destination prefixes • Lost reachability to some destinations • Route hijacking • Instability of the routes • Input to traffic-engineering tools • Knowing the current routes in the network • Workload for testing routers • Realistic message traces to play back to routers
BGP Monitoring: A Wish List • Ideally: knowing what the router knows • All externally-learned routes • Before policy has modified the attributes • Before a single best route is picked • How to achieve this • Special monitoring session on routers that tells everything they have learned • Packet monitoring on all links with BGP sessions • If you can’t do that, you could always do… • Periodic dumps of routing tables • BGP session to learn best route from router
Using Routers to Monitor BGP Establish a “passive” BGP session from a workstation running BGP software Talk to operational routers using SNMP or telnet at command line eBGP or iBGP (+) BGP table dumps do not burden operational routers (-) Receives only best routes from BGP neighbor (+) Update dynamics captured (+) not restricted to interfaces provided by vendors (-) BGP table dumps are expensive (+) Table dumps show all alternate routes (-) Update dynamics lost (-) restricted to interfaces provided by vendors
Collect BGP Data From Many Routers Seattle Cambridge Chicago Detroit New York Kansas City Philadelphia Denver San Francisco St. Louis Washington, D.C. 2 Los Angeles Dallas Atlanta San Diego Phoenix Austin Orlando Houston Route Monitor BGP is not a flooding protocol
Detecting Important Routing Changes • Large volume of BGP updates messages • Around 2 million/day, and very bursty • Too much for an operator to manage • Identify important anomalies • Lost reachability • Persistent flapping • Large traffic shifts • Not the same as root-cause analysis • Identify changes and their effects • Focus on mitigation, rather than diagnosis • Diagnose causes if they occur in/near the AS
BR BR BR E E E E E E BGP Update Grouping BGP Updates Events Challenge #1: Excess Update Messages • A single routing change • Leads to multiple update messages • Affects routing decision at multiple routers Persistent Flapping Prefixes Group updates for a prefix with inter-arrival < 70 seconds, and flag prefixes with changes lasting > 10 minutes.
Determine “Event Timeout” Cumulative distribution of BGP update inter-arrival time BGP beacon (70, 98%)
Long Events Event Duration: Persistent Flapping Complementary cumulative distribution of event duration (600, 0.1%)
Detecting Persistent Flapping • Significant persistent flapping • 15.2% of all BGP update messages • … though a small number of destination prefixes • Surprising, especially since flap dampening is used • Types of persistent flapping • Conservative flap-damping parameters (78.6%) • Protocol oscillations, e.g., MED oscillation (18.3%) • Unstable interface or BGP session (3.0%)
C B A D E E E E Example: Unstable eBGP Session • Flap damping parameters is session-based • Damping not implemented for iBGP sessions Peer AT&T p Customer
No Disruption Event Classification Loss/Gain of Reachability “Typed” Events Events Internal Disruption Single External Disruption Multiple External Disruption Challenge #2: Identify Important Events • Major concerns of network operators • Changes in reachability • Heavy load of routing messages on the routers • Flow of the traffic through the network Classify events by type of impact it has on the network
C D A B E E E E E E Event Category – “No Disruption” p AS2 AS1 No Traffic Shift “No Disruption”: each of the border routers has no traffic shift AT&T
C D A B E E E E E E Event Category – “Internal Disruption” p AS2 AS1 “Internal Disruption”: all of the traffic shifts are internal traffic shift AT&T Internal Traffic Shift
C D A B E E E E E E Event Type: “Single External Disruption” p AS2 AS1 external Traffic Shift AT&T “Single External Disruption”: traffic at one exit point shifts to other exit points
Event Correlation “Typed” Events Clusters Challenge #3: Multiple Destinations • A single routing change • Affects multiple destination prefixes Group events of same type that occur close in time
Main Causes of Large Clusters • External BGP session resets • Failure/recovery of external BGP session • E.g., session to another large tier-1 ISP • Caused “single external disruption” events • Validated by looking at syslog reports on routers • Hot-potato routing changes • Failure/recovery of an intradomain link • E.g., leads to changes in IGP path costs • Caused “internal disruption” events • Validated by looking at OSPF measurements
BR BR BR E E E E E E Traffic Impact Prediction Large Disruptions Clusters Challenge #4: Popularity of Destinations • Impact of event on traffic • Depends on the popularity of the destinations Netflow Data Weight the group of destinations by the traffic volume
Traffic Impact Prediction • Traffic weight • Per-prefix measurements from Netflow • 10% prefixes accounts for 90% of traffic • Traffic weight of a cluster • The sum of “traffic weight” of the prefixes • Flag clusters with heavy traffic • A few large clusters have large traffic weight • Mostly session resets and hot-potato changes
Conclusions • Network troubleshooting from the inside • Traffic, topology, and routing data • Easier to understand what’s going on • … though still challenging to collect/analyze data • Traffic measurement • SNMP, packet monitoring, and flow monitoring • Routing monitors • Track network state and identify anomalies • Intradomain monitor capturing LSAs • BGP monitor capturing BGP updates
Next Time: BGP Routing Table Size • Three papers • “On characterizing BGP routing table growth” • “An empirical study of router response to large BGP routing table load” • “A framework for interdomain route aggregation” • Review only of the first paper • Summary • Why accept • Why reject • Avenues for future work • Optional • Vanevar Bush on “As We May Think” (1945)