1 / 35

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services. Ming Zhang , Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University. Motivation. Routing anomalies are common on Internet Maintenance Power outage Fiber cut Misconfiguration …

Download Presentation

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University

  2. Motivation • Routing anomalies are common on Internet • Maintenance • Power outage • Fiber cut • Misconfiguration • … • Anomalies can affect end-to-end performance • Packet losses • Packet delays • Disconnectivities

  3. Background • Anomaly detection and diagnosis are nontrivial • Asymmetric paths • Failure information propagation • Highly varied durations • Limited coverage

  4. Contributions • New techniques for • Anomaly detection • Anomaly isolation • Anomaly classification • Large-scale study of anomalies • Broad coverage • High detection rate, low overhead • Characterization of anomalies • End-to-end effects • Benefits to host service

  5. Outline • State of the Art • PlanetSeer Components • MonD – passive monitoring • ProbeD – active probing • Anomaly Analysis • Loop-based anomaly • Non-loop anomaly • Bypassing Anomalies • Summary

  6. State of the Art • Routing messages • BGP: AS-level diagnosis • IS-IS, OSPF: Within single ISP • Router/link traffic statistics • SNMP, NetFlow: proprietary • End-to-end measurement • Ping, traceroute

  7. End-to-End Probing • All-pairs probes among n nodes • O(n^2) measurement cost • Not scalable as n grows

  8. Key Observation • Combine passive monitoring with active probing • Peer-to-Peer (P2P), Content Distribution Network (CDN) • Large client population • Geographically distributed nodes • Large traffic volume • Highly diverse paths • The traffic generated by the services reveals information about the network.

  9. Our Approach • Host service • CDN • Components • Passive monitoring • Active probing • Advantages • Low overhead • Wide coverage Client C R1 R2 B A

  10. MonD: Anomaly Detection • Anomaly indicators • Time-to-live (TTL) change • Routing change • n consecutive timeouts (n = 4 in current system) • Idling period of 3 to 16 seconds • most congestion periods < 220ms

  11. ProbeD Operation • Baseline probes • When a new IP appears • From local node • Forward probes • When a possible anomaly detected • From multiple nodes (including local node) • Reprobes • At 0.5, 1.5, 3.5 and 7.5 hours later • From local node

  12. ProbeD Groups • 353 nodes, 145 sites, 30 groups • According to geographic location • One traceroute per group

  13. Local ProbeD Client RemoteProbeD ra rd rb rc Estimating Scope • Which routers might be affected? • Routers which possibly change their next hops • Traceroutes from multiple locations can narrow the scope

  14. Core Edge 215 ASes 22 ASes 1392 ASes 1420 ASes 13872 ASes Path Diversity • Monitoring Period: 02/2004 – 05/2004 • Unique IPs: 887,521 • Traversed ASes: 10,090

  15. Confirming Anomalies • Reported anomalies • 2,259,588 • Conditions • Loops • Route change • Partial unreachability • ICMP unreachable • Very conservative confirmation Undecided 22% Non-anomaly 66% Anomaly 12%

  16. Confirmed Anomaly Breakdown • Confirmed anomalies • 271,898 • 2 per minute • 100x more • Temp anomalies • Inconsistent probes Temp Anomalies 16% Persist Loop 7% Temp loop 1% Path Change 44% Other Outage 23% Fwd Outage 9%

  17. 1% persist loops cross ASes 15% temp loops cross ASes Scope of Loops • How many routers or ASes are involved? • Temp loops involve more routers than persistent loops • 97% persistent loops and 51% temp loops contain 2 hops

  18. Distribution of Loops • Many persistent loops in tier-3, few in tier-1 • Worst 10% of tier-1 ASes – implications for largest ISPs • 20% traffic • 35% persistent loops

  19. Duration of Persistent Loops • How long do persistent loops last? • Either resolve quickly or last for an extended period

  20. 78% outages within 2 ASes 57% changes within 2 ASes Scope of Forward Anomalies • How many routers or ASes are affected? • 60% outages within 1 hops • 75% outages and 68% changes within 4 hops

  21. Location of Forward Anomalies • How close are the anomalies to the edges of the network? • 44% outages at the last hop • 72% outages and 40% changes within 4 hops

  22. Distribution of Forward Anomalies • Which ASes are affected? • Tier-1 ASes most stable • Tier-3 ASes most likely to be affected

  23. destination source intermediate Overlay Routing • Use alternate path when default path fails

  24. Bypassing Anomalies • How useful is overlay routing for bypassing failures? • Effective in 43% of 62,815 failures, lower than previous studies • 32% bypass paths inflate RTTs by more than a factor of two

  25. Summary • Confirm 272,000 anomalies in 3 months • Persistent and temporary loops • Persistent loops narrower scope, either resolve quickly or last for a long time • Path outages and changes • Outages closer to edge, narrower scope • Anomaly distribution • Skewed. Tier-1 most stable. Tier-3 most problematic. • Overlay routing • Bypasses 43% failures, latency inflation

  26. More Information • In the paper • More details about anomaly characteristics • End-to-end impacts • Classification methodology • Optimizations to reduce overheads & improve confirmation rate • mzhang@cs.princeton.edu • http://www.cs.princeton.edu/nsg/infoplane

  27. Classifying Anomalies • Temporary vs. persistent loops • Whether exit loops at maximum hop • Path changes vs. outages • Changes: follow different paths to clients • Outages: stop at intermediate hops ProbeD Client

  28. Non-anomalies • Non-anomalies • Ultrashort anomalies • Path-based TTL • Aggressive timeout

  29. Identifying Forward Outages • Forward outages • Route change • ICMP dest unreachable • Forward timeout

  30. Loop Effect on RTT • How do loops affect RTTs? • Loops can incur high latency inflation

  31. Loop Effect on Loss Rate • How do loops affect loss rates? • 65% temporary and 55% persistent loops preceded by loss rates exceeding 30%

  32. Forward Anomaly Effect on RTT • How do forward anomalies affect RTTs? • Outages and changes can incur latency inflation • Outages have more negative effect on RTTs

  33. Forward Anomaly Effect on Loss Rate • How do forward anomalies affect loss rates? • 45% outages and 40% changes preceded by loss rates exceeding 30%

  34. Reducing Measurement Overhead • Can we reduce the number of probes? • 15 probes can achieve the same accuracy in 80% cases • Flow-based TTL

  35. Traffic Breakdown By Tiers

More Related