400 likes | 548 Views
Internet Routing (COS 598A) Today: Root-Cause Analysis. Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm. Outline. Network troubleshooting Motivation for network troubleshooting Investigating from the edge vs. inside Active probing
E N D
Internet Routing (COS 598A)Today: Root-Cause Analysis Jennifer Rexford http://www.cs.princeton.edu/~jrex/teaching/spring2005 Tuesdays/Thursdays 11:00am-12:20pm
Outline • Network troubleshooting • Motivation for network troubleshooting • Investigating from the edge vs. inside • Active probing • Traceroute • Mapping IP addresses to AS numbers • Passive monitoring • Analyzing BGP update streams • Identifying location and cause of routing change • Limitations of the approach
Network Troubleshooting “Why can’t I reach www.cnn.com?” “Why is the performance bad?” Internet www.cnn.com
Reachability Problems: What Could be Wrong? • End-host problem • Web server down • DNS server down, or misconfigured • Forwarding-path problem • Packet filter or firewall restricting access • Mismatch in Maximum Transmission Unit (MTU) • Routing problem • User or server disconnected from Internet • Blackhole dropping all packets • Persistent loop
Performance Problem: What Could be Wrong? • End-host problems • Overloaded Web server • Overloaded DNS server • Overloaded user machine • Forwarding-path problem • High round-trip time • Link congestion • Routing problem • Long-term routing instability • Transient disruption during convergence
Motivation for Troubleshooting • Improving performance • Detect, diagnose, and fix the problem • Pick a path through another provider • Pick a different path in any overlay network • Establishing accountability • Enforce Service Level Agreements • Rate service providers • Characterizing the Internet • Understand causes of performance problems • Understand challenges of troubleshooting
Troubleshooting Outside vs. Inside • Outside: from network edge • Who: users and researchers, and operators troubleshooting problems outside their network • Data: ping/traceroute, public feeds of BGP updates, and public measurement platforms • Challenges: inference from very limited data • Inside: from inside the network • Who: operators running a network • Data: SNMP, fault data, traffic measurement, route monitors, and router configuration files • Challenges: collecting and joining the data Today
Pros and Cons of Active Probing • Advantages • Can run from any end system • Measure the actual forwarding path • See black-holes, loops, and delays directly • Disadvantages • Effects of routing changes, not the cause • Current path, not the path used in the past • Requires frequent probes to observe the changes • Shows only properties of round-trip path • Hard to tell if problem is on forward vs. reverse
Time exceeded TTL=1 TTL=2 Traceroute: Measuring the Forwarding Path • Time-To-Live field in IP packet header • Source sends a packet with a TTL of n • Each router along the path decrements the TTL • “TTL exceeded” sent when TTL reaches 0 • Traceroute tool exploits this TTL behavior destination source Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message
No response from router No name resolution Example Traceroute Output (Berkeley to CNN) Hop number, IP address, DNS name 1 169.229.62.1 2 169.229.59.225 3 128.32.255.169 4 128.32.0.249 5 128.32.0.66 6 209.247.159.109 7 * 8 64.159.1.46 9 209.247.9.170 10 66.185.138.33 11 * 12 66.185.136.17 13 64.236.16.52 inr-daedalus-0.CS.Berkeley.EDU soda-cr-1-1-soda-br-6-2 vlan242.inr-202-doecev.Berkeley.EDU gigE6-0-0.inr-666-doecev.Berkeley.EDU qsv-juniper--ucb-gw.calren2.net POS1-0.hsipaccess1.SanJose1.Level3.net ? ? pos8-0.hsa2.Atlanta2.Level3.net pop2-atm-P0-2.atdn.net ? pop1-atl-P4-0.atdn.net www4.cnn.com
Example Troubleshooting Results • No packets go beyond your gateway • Gateway’s connection to Internet is dead • Traceroute stops at intermediate point • Perhaps a blackhole • Traceroute path has a loop • Transient or persistent forwarding loop • Traceroute shows a very long path • Routing anomaly, route hijacking, etc. • Traceroute shows very long delays • Delay or congestion on forward or reverse path
Problems with Traceroute • Missing responses • Routers might not send “Time-Exceeded” • Firewalls may drop the probe packets • “Time-Exceeded” reply may be dropped • Misleading responses • Probes taken while the path is changing • Name not in DNS, or DNS entry misconfigured • Mapping IP addresses • Mapping interfaces to a common router • Mapping interface/router to Autonomous System
AS25 AS25 AS25 AS25 AS11423 AS3356 AS3356 AS3356 AS3356 AS1668 AS1668 AS1668 AS5662 Berkeley Calren Level3 AOL CNN Map Traceroute Hops to ASes Traceroute output: (hop number, IP) 1 169.229.62.1 2 169.229.59.225 3 128.32.255.169 4 128.32.0.249 5 128.32.0.66 6 209.247.159.109 7 * 8 64.159.1.46 9 209.247.9.170 10 66.185.138.33 11 * 12 66.185.136.17 13 64.236.16.52 Need accurate IP-to-AS mappings (for network equipment).
Candidate Ways to Get IP-to-AS Mapping • Routing address registry • Voluntary public registry such as whois.radb.net • Used by prtraceroute and “NANOG traceroute” • Incomplete and quite out-of-date • Mergers, acquisitions, delegation to customers • Origin AS in BGP paths • Public BGP routing tables such as RouteViews • Used to translate traceroute data to an AS graph • Incomplete and inaccurate… but usually right • Multiple Origin ASes, no mapping, wrong mapping
Example: BGP Table (“show ip bgp” at RouteViews) Network Next Hop Metric LocPrf Weight Path * 3.0.0.0/8 205.215.45.50 0 4006 701 80 i * 167.142.3.6 0 5056 701 80 i * 157.22.9.7 0 715 1 701 80 i * 195.219.96.239 0 8297 6453 701 80 i * 195.211.29.254 0 5409 6667 6427 3356 701 80 i *>12.127.0.249 0 7018 701 80 i * 213.200.87.254 929 0 3257 701 80 i * 9.184.112.0/20 205.215.45.50 0 4006 6461 3786 i * 195.66.225.254 0 5459 6461 3786 i *>203.62.248.4 0 1221 3786 i * 167.142.3.6 0 5056 6461 6461 3786 i * 195.219.96.239 0 8297 6461 3786 i * 195.211.29.254 0 5409 6461 3786 i AS 80 is General Electric, AS 701 is UUNET, AS 7018 is AT&T AS 3786 is DACOM (Korea), AS 1221 is Telstra
Why Would IP-to-AS Mapping Be Wrong? • IP addresses of equipment • Interfaces on the routers, not end hosts • Identifies equipment in routing protocols • Doesn’t need to be globally visible consistent • Three reasons the mappings may be “wrong” • Addresses of Internet Exchange Points • Sibling ASes that share address space • ASes that don’t announce their addresses • Look at traceroute path vs. BGP AS path • Traceroute path after IP-to-AS mapping • BGP AS path taken from the BGP table
Extra AS due to Internet eXchange Points • IXP: shared place where providers meet • E.g., Mae-East, Mae-West, PAIX • Large number of fan-in and fan-out ASes E A A E F B F B D G C G C Traceroute AS path BGP AS path Ignore extra traceroute AS hop with high fan-in and fan-out
Extra AS due to Sibling ASes • Sibling: organizations with multiple ASes: • E.g., Sprint AS 1239 and AS 1791 • AS numbers equipment with addresses of another E A E A F B H D F B D G C G C Traceroute AS path BGP AS path Merge sibling ASes “belong together” as if they were one AS.
A C A C A C B A C B C Unannounced Infrastructure Addresses 12.0.0.0/8 A B C does not announce part of its address space in BGP(e.g., 12.1.2.0/24) C Fix the IP-to-AS map to associate 12.1.2.0/24 with C
Refining Initial IP-to-AS Mapping • Start with initial IP-to-AS mapping • Mapping from BGP tables is usually correct • Good starting point for computing the mapping • Collect many BGP and traceroute paths • Signaling and forwarding AS path usually match • Good way to identify mistakes in IP-to-AS map • Successively refine the IP-to-AS mapping • Find add/change/delete that makes big difference • Base these “edits” on operational realities http://www.cs.princeton.edu/~jrex/papers/sigcomm03.pdf http://www.cs.princeton.edu/~jrex/papers/infocom04.pdf
Research Areas • Better version of traceroute • Router support for active measurement • IPPM (IP Performance Measurement) • http://www1.ietf.org/mail-archive/web/imrg/current/msg00154.html • Peer-to-peer troubleshooting www.cnn.com “Yes” “No”
Limitations of Active Measurements • Active measurements: traceroute-like tools • Can’t probe in the past • Shows the effect, not the cause Web Server (d) AS 2 AS 4 AS 1 User (s) AS 3
Appealing to Peek Inside • Passive measurements: public BGP data BGP update feeds Data Correlation Data Collection (RouteViews, RIPE) root cause
Inspect BGP Routing Changes • Changes in paths to reach destination d • AS 1: “1 3 4” “1 2 4” • AS 2: “2 4” (no change) • AS 3: “3 4” “3 1 2 4” • AS 4: “4” (no change) Web Server (d) AS 2 AS 4 AS 1 User (s) AS 3
Idea #1: ASes in Paths Undergoing Change • Key assumption • “The AS responsible for the change appears in the old and/or the new AS path to the destination.” • If an AS has a routing change • All ASes in old and new paths may be responsible • Call these ASes the “suspect set” • Combining across vantage points • Consider all ASes that had a routing change • Perform the intersection across the suspect sets
Idea #2: Excluding ASes in Non-Changing Paths • Key assumption • “If an AS has no routing change, the ASes in the path are not responsible and can be excluded.” • Example • AS 1: “1 2 4” “1 2 3 4”: suspects {1, 2, 3, 4} • AS 2: “2 4” “2 3 4”: suspects {2, 3, 4} • AS 3: “3 4” (no change): non-suspects {3, 4} AS 3 AS 2 AS 1 AS 4
Idea #3: Blaming the ASes in the Better Path • Key assumption • “The better path is the one that contains the AS responsible for the change.” • Example • “1 2 4” “1 2 3 4”: better path to worse path, with ASes {1,2,4} as the suspects (not AS 3) • Heuristics for identifying the “better” path • E.g., the shorter AS path AS 3 AS 2 AS 1 AS 4
Idea #4: Combining Across Destinations • Key assumption • “All destinations experiencing routing changes in a short period of time have a common cause.” • Exploiting the observation • Form suspect sets for each destination • Perform intersections of the sets across the destinations
Difficulties With Root-Cause Analysis • Misleading BGP routing changes • Responsible AS not on old or new path • Looking across destinations doesn’t resolve • Missing routing changes • Some routers in an AS don’t have a change • Some subnets are not visible in BGP • Some internal changes are not visible in BGP
1 4 5 6 2 3 7 8 9 10 11 Misleading BGP Changes Myth:The AS responsible for the change appears in the old or the new AS path. BGP data collection old: 1,2,8,9,10 new: 1,4,5,6,7,10
12 BGP data collection Misleading BGP Changes Myth:Looking at routing changes across prefixes resolves causes d2 AS 3 d3 AS 2 AS 1 d1 A B 7 10 C Changes for d2, but not for d1 and d3
A B D C BGP data collection No change Missing Routing Changes Myth: The BGP updates from a single router accurately represent the AS dst AS 2 AS 1 7 6 10 12
Missing Routing Changes Myth:BGP data from a router accurately represents changes on that router. 12.1.1.0/24 A BGP data collection 12.1.0.0/16
A B D C BGP data collection Missing Routing Changes Myth:Routing changes visible in eBGP have greater impact end-to-end impact than changes with local scope. dst AS 2 AS 1 5 7 6 10 12
(i,s,d,t) failure link (3,4) (j,s,d,t’) failure link (3,4) Hybrid of Active and Passive Monitoring Omni 2 Omni 4 Web Server (d) AS 2 AS 4 AS 1 i User (s) AS 3 Omni 1 j Omni 3
Research Questions • Understanding if root-cause analysis can work • How many vantage points are needed? • Do the assumptions usually hold? • Can algorithms tolerate occasional violations? • Can some additional information help? • Distributed algorithms for root-cause analysis • Can ASes cooperate in distributed fashion? • How to prevent or detect ASes that cheat? • Do all ASes have to participate? • Other hybrids of active and passive monitoring?
Conclusions • Troubleshooting is important • Detect, diagnose, and fix problems • Accountability and service-level agreements • Troubleshooting is hard • Active measurement (e.g., traceroute) not enough • Root-cause analysis techniques are not enough • New innovation necessary • Hybrid active/passive approaches • Router support for active measurement • Routing protocol extensions for troubleshooting
For Next Time: From Inside an AS • Two papers • “OSPF monitoring: Architecture, design, and deployment experience” • “Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network” • Optional reading • Materials from Packet Design and Ipsum Networks • Review only of first paper • Summary • Why accept • Why reject • Future work