220 likes | 346 Views
Magellan: A Tool for Unicast Fault Isolation. Cengiz Alaettinoglu Packet Design LLC Ramesh Govindan Information Sciences Institute John Mehringer Information Sciences Institute. Motivation. Why can't I reach www.cnn.com? Why is the Internet soooo slow today? It was fine yesterday!.
E N D
Magellan: A Tool for Unicast Fault Isolation Cengiz Alaettinoglu Packet Design LLC Ramesh Govindan Information Sciences Institute John Mehringer Information Sciences Institute
Motivation • Why can't I reach www.cnn.com? • Why is the Internet soooo slow today? • It was fine yesterday!
Goals • User's perspective • What is of interest to user • Internet wide routing monitoring • not just an AS • History of route changes • not just a snapshot • Fault diagnosis • link/router failure/repair
Challenges • Scaling • Directed search by correlating destinations • Shared learning • Automated heuristics for fault isolation • Route change • Location of link/router failure/repair • Oscillations • Others?
Data Collection • Select target's interesting to the user • tcpdump/libpcap • Weighting / aging (not implemented) • Initial path to targets • traceroute • Monitoring paths • Carefully constructed ICMP probes
Monitoring • Construct a routing graph • Nodes: routers • Links: (to, from, source, destination, hop, statistics...) • Probe each link • Send two ICMP Echo Request packets to destination • For ttl = hop - 1, hop, verify incident routers, to, from
Scheduling Probes • WRR schedule a probe for each link • Limits the rate of probe packets • Weights: some links are more important/interesting • Distance to link • No of destinations using it • History of volatility • Exponentially averaged
Test Result • Positive • Do nothing • Negative • Determine new path • Incremental traceroute from the link upstream and downstream • Determine cause • Automatic heuristics based
Active Fault Isolation • Link failure • Probe the link using other destinations that uses it • Correlate results • Router failure • Generalize on link failure • Oscillations • History of old routes • Back and forth between a set of routes
Magellan Components Magellan Nam • Visualization • Offline or real-time • Great for debugging/tuning Perl Script
Snapshot • Link or router failure • I want the nam buttons, etc...
Effectiveness thru Measurement • Picked 500 popular web sites • Yahoo, msn, aol, cnn, ... • www.web100.com • Monitored routes to these destinations for 7 days
Measurements • Number of Link Probes: 839694 • Probe per second: 1.39 / second • Total Failures: 2078 • Router Failures: 334 • Link Failures: 951 • Unknown cause: 793 • Transients • Number of Oscillations: 541
Future work: Distributed Magellan • Weight to probe inversely proportional to ratio of distances • Shared learning Magellan 1 Magellan 2
Related Work • Topology Maps • Router/AS level interconnections • Mercator, skitter, AT&T • Not all links are usable (routing policy/metrics) • Routing Topology • Effect of policy/metrics • Npd Vern Paxson's work • Focus is on measurement
Conclusions • Unicast fault isolation • User's perspective • Automated heuristics • History of changes • http://www.isi.edu/scan