350 likes | 477 Views
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services. Ming Zhang , Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University. Motivation. Routing anomalies are common on Internet Maintenance Power outage Fiber cut Misconfiguration …
E N D
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University
Motivation • Routing anomalies are common on Internet • Maintenance • Power outage • Fiber cut • Misconfiguration • … • Anomalies can affect end-to-end performance • Packet losses • Packet delays • Disconnectivities
Background • Anomaly detection and diagnosis are nontrivial • Asymmetric paths • Failure information propagation • Highly varied durations • Limited coverage
Contributions • New techniques for • Anomaly detection • Anomaly isolation • Anomaly classification • Large-scale study of anomalies • Broad coverage • High detection rate, low overhead • Characterization of anomalies • End-to-end effects • Benefits to host service
Outline • State of the Art • PlanetSeer Components • MonD – passive monitoring • ProbeD – active probing • Anomaly Analysis • Loop-based anomaly • Non-loop anomaly • Bypassing Anomalies • Summary
State of the Art • Routing messages • BGP: AS-level diagnosis • IS-IS, OSPF: Within single ISP • Router/link traffic statistics • SNMP, NetFlow: proprietary • End-to-end measurement • Ping, traceroute
End-to-End Probing • All-pairs probes among n nodes • O(n^2) measurement cost • Not scalable as n grows
Key Observation • Combine passive monitoring with active probing • Peer-to-Peer (P2P), Content Distribution Network (CDN) • Large client population • Geographically distributed nodes • Large traffic volume • Highly diverse paths • The traffic generated by the services reveals information about the network.
Our Approach • Host service • CDN • Components • Passive monitoring • Active probing • Advantages • Low overhead • Wide coverage Client C R1 R2 B A
MonD: Anomaly Detection • Anomaly indicators • Time-to-live (TTL) change • Routing change • n consecutive timeouts (n = 4 in current system) • Idling period of 3 to 16 seconds • most congestion periods < 220ms
ProbeD Operation • Baseline probes • When a new IP appears • From local node • Forward probes • When a possible anomaly detected • From multiple nodes (including local node) • Reprobes • At 0.5, 1.5, 3.5 and 7.5 hours later • From local node
ProbeD Groups • 353 nodes, 145 sites, 30 groups • According to geographic location • One traceroute per group
Local ProbeD Client RemoteProbeD ra rd rb rc Estimating Scope • Which routers might be affected? • Routers which possibly change their next hops • Traceroutes from multiple locations can narrow the scope
Core Edge 215 ASes 22 ASes 1392 ASes 1420 ASes 13872 ASes Path Diversity • Monitoring Period: 02/2004 – 05/2004 • Unique IPs: 887,521 • Traversed ASes: 10,090
Confirming Anomalies • Reported anomalies • 2,259,588 • Conditions • Loops • Route change • Partial unreachability • ICMP unreachable • Very conservative confirmation Undecided 22% Non-anomaly 66% Anomaly 12%
Confirmed Anomaly Breakdown • Confirmed anomalies • 271,898 • 2 per minute • 100x more • Temp anomalies • Inconsistent probes Temp Anomalies 16% Persist Loop 7% Temp loop 1% Path Change 44% Other Outage 23% Fwd Outage 9%
1% persist loops cross ASes 15% temp loops cross ASes Scope of Loops • How many routers or ASes are involved? • Temp loops involve more routers than persistent loops • 97% persistent loops and 51% temp loops contain 2 hops
Distribution of Loops • Many persistent loops in tier-3, few in tier-1 • Worst 10% of tier-1 ASes – implications for largest ISPs • 20% traffic • 35% persistent loops
Duration of Persistent Loops • How long do persistent loops last? • Either resolve quickly or last for an extended period
78% outages within 2 ASes 57% changes within 2 ASes Scope of Forward Anomalies • How many routers or ASes are affected? • 60% outages within 1 hops • 75% outages and 68% changes within 4 hops
Location of Forward Anomalies • How close are the anomalies to the edges of the network? • 44% outages at the last hop • 72% outages and 40% changes within 4 hops
Distribution of Forward Anomalies • Which ASes are affected? • Tier-1 ASes most stable • Tier-3 ASes most likely to be affected
destination source intermediate Overlay Routing • Use alternate path when default path fails
Bypassing Anomalies • How useful is overlay routing for bypassing failures? • Effective in 43% of 62,815 failures, lower than previous studies • 32% bypass paths inflate RTTs by more than a factor of two
Summary • Confirm 272,000 anomalies in 3 months • Persistent and temporary loops • Persistent loops narrower scope, either resolve quickly or last for a long time • Path outages and changes • Outages closer to edge, narrower scope • Anomaly distribution • Skewed. Tier-1 most stable. Tier-3 most problematic. • Overlay routing • Bypasses 43% failures, latency inflation
More Information • In the paper • More details about anomaly characteristics • End-to-end impacts • Classification methodology • Optimizations to reduce overheads & improve confirmation rate • mzhang@cs.princeton.edu • http://www.cs.princeton.edu/nsg/infoplane
Classifying Anomalies • Temporary vs. persistent loops • Whether exit loops at maximum hop • Path changes vs. outages • Changes: follow different paths to clients • Outages: stop at intermediate hops ProbeD Client
Non-anomalies • Non-anomalies • Ultrashort anomalies • Path-based TTL • Aggressive timeout
Identifying Forward Outages • Forward outages • Route change • ICMP dest unreachable • Forward timeout
Loop Effect on RTT • How do loops affect RTTs? • Loops can incur high latency inflation
Loop Effect on Loss Rate • How do loops affect loss rates? • 65% temporary and 55% persistent loops preceded by loss rates exceeding 30%
Forward Anomaly Effect on RTT • How do forward anomalies affect RTTs? • Outages and changes can incur latency inflation • Outages have more negative effect on RTTs
Forward Anomaly Effect on Loss Rate • How do forward anomalies affect loss rates? • 45% outages and 40% changes preceded by loss rates exceeding 30%
Reducing Measurement Overhead • Can we reduce the number of probes? • 15 probes can achieve the same accuracy in 80% cases • Flow-based TTL