600 likes | 731 Views
Proactive Techniques for Correct and Predictable Internet Routing. Nick Feamster. The Internet. Internet Routing. Large-scale: Thousands of autonomous networks Self-interest: Independent economic and performance objectives But, must cooperate for global connectivity. Abilene.
E N D
Proactive Techniques for Correct and Predictable Internet Routing Nick Feamster
The Internet Internet Routing • Large-scale: Thousands of autonomous networks • Self-interest: Independent economic and performance objectives • But, must cooperate for global connectivity Abilene Comcast MIT AT&T Cogent
Session Destination Next-hop AS Path 18.0.0.0/8 192.5.89.89 10578..3 66.250.252.44 18.0.0.0/8 174… 3 Internet Routing Protocol: BGP Autonomous Systems (ASes) Route Advertisement Traffic
Configuration Defines BGP Behavior Flexibility for realizing goals in complex business landscape • Which neighboring networks can send traffic • Where traffic enters and leaves the network • How routers within the network learn routes to external destinations Traffic No Route Route Flexibility Complexity
Configuring routers is like writing a distributed program. Operators make mistakes, often with catastrophic results. Problem
Catastrophic Configuration Faults “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com, April 25, 1997 “Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001 “WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).” -- dslreports.com, February 23, 2004
Operator Mailing List Note: Only includes problems openly discussed on this list. Feamster et al., “An Empirical Study of ‘Bogon’ Route Advertisements”, SIGCOMM CCR 2005
Why Correctness is Hard • Operators make mistakes • Configuration is difficult • Complex policies, distributed configuration • Interactions cause unintended consequences • Each network independently configured • Unintended policy interactions
Goal Correctness and predictability of the global routing system, examining only local configurations
Today: Reactive Operation What happens if I tweak this policy…? • Problems cause downtime • Problems often not immediately apparent Revert No Yes Desired Effect? Wait for Next Problem Configure Observe
Proactive Techniques rcc Thesis: Proactive Operation • Idea: Analyze configuration before deployment Detect Faults Predict Traffic Flow Configure Deploy Many faults can be detected with static analysis.
Contributions • Correctness specification and constraints • rcc (“router configuration checker”) • Static configuration analysis tool for fault detection • Used by operators of large backbone networks • Analysis of real-world network configurations from 17 autonomous systems (ASes) • Route prediction using static configuration analysis • Sufficient and necessary conditions for safe routing • http://nms.csail.mit.edu/rcc/ • About 100 downloads (70 network operators)
Take-home lessons • Configuration can be factored into a few operations • Static configuration analysis uncovers errors • Major causes of error: • Distributed configuration • Intra-AS dissemination is too complex • Mechanistic expression of policy • Guaranteeing safety while preserving autonomy requires tight restrictions on expressiveness
Outline • Correctness specification and constraints • Path visibility • Route validity • Safety • Proactive fault detection with rcc • Path visibility and route validity • Implementation and findings • Safety v. policy expressiveness • Local guarantees for safety • Implications
Policy: P(vi-1, vi, vi+1, d) 0,1 (d v2) (d v3) Paths, Routes, and Policy Path: (v1, v2, ..., vn) d Route: (d vi) vn vi v2 v1 Routes induce paths Consistency: All induced paths along a path to the destination are subpaths of the original path Policy-conformance: All nodes along an induced path have P=1
Filtering: route advertisement Dissemination: internal route advertisement Factoring Routing Configuration Hundreds of thousands of lines of configuration in hundreds of routers. Ranking: route selection Customer Primary Competitor Backup
Path visibility faults Dissemination • Partition in graph that disseminates routes Next 2 slides Filtering • Filtering routes for usable paths Path Visibility If there exists a path, then there exists a route If there is at least one policy-conformant path to the destination, then routers should select routes that induce one of them.
“iBGP” Path Visibility: Internal BGP (iBGP) Default: “Full mesh” iBGP. Doesn’t scale. Large ASes use “Route reflection” Route reflector: non-client routes over client sessions; client routes over all sessions Client: don’t re-advertise iBGP routes.
iBGP Signaling: Static Check Theorem. Suppose the iBGP reflector-client relationship graph contains no cycles. Then, path visibility is satisfied if, and only if, the set of routers that are not route reflector clients forms a clique. Condition is easy to check with static analysis.
Route validity faults Filtering Next slide - Advertising routes that violate higher-level policy- Originating routes for private (or unowned) address space Dissemination - Loops and “deflections” along internal routing path Route Validity If there exists a route, then there exists a path Routers should select routes that induce only consistent, policy-conformant paths. Must form beliefs about high-level policy
Sprint AT&T Route Validity: Consistent Export • Settlement-free peering rules: • Advertise routes at all peering points • Advertised routes must have equal “AS path length” “equally good” routes on all BGP sessions Some ASes routinely violate this constraint. [IMC 2004]
Safety The protocol does not oscillate The protocol computes a stable path assignment for every initial state and message ordering. Depends on the interactions of rankings and filters of multiple ASes Challenge: Guarantee safety with only “local” information (Preserve the autonomy of each AS.)
Outline • Correctness specification and constraints • Path visibility • Route validity • Safety • Proactive fault detection with rcc • Path visibility and route validity • Implementation and findings • Safety v. policy expressiveness • Local guarantees for safety • Implications
rcc Overview Distributed router configurations (Single AS) • Analyzing complex, distributed configuration • Defining a correctness specification • Mapping specification to constraints “rcc” Correctness Specification Constraints Faults Normalized Representation Challenges
http://nms.csail.mit.edu/rcc/ rcc Implementation Preprocessor Parser Distributed router configurations Relational Database (mySQL) (Cisco, Avici, Juniper, Procket, etc.) Constraints Verifier Faults
Path Visibility Faults in Practice Analysis of configuration from 17 ASes 420 sessions(8 ASes) 133 routers(7 ASes) 11 Partitions(6 ASes)
Route Validity Faults in Practice Analysis of configuration from 17 ASes 233 Sessions(9 ASes) 196 Sessions(6 ASes) 117 Sessions(7 ASes) 45 Sessions(7 ASes) 6 Sessions(1 AS)
Causes of Error Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration Route Validity Path Visibility
Feedback From Network Operators “That’s wicked!” -- Nicolas Strina, ip-man.net “Thanks again for a great tool.” -- Paul Piecuch, IT Manager “...good to finally see more coverage of routing as distributed programming. From my experience, the principles of software engineering eliminate a vast majority of errors.” --Joe Provo, rcn.com “I find your approach useful, it is really not fun (but critical for the health of the network) to keep track of the inconsistencies among different routers…a configuration verifier like yours can give the operator a degree of confidence that the sky won't fall on his head real soon now.” -- Arnaud Le Tallanter, clara.net
rcc: Take-home lessons • Static configuration analysis uncovers many errors • Major causes of error: • Distributed configuration • Intra-AS dissemination is too complex • Mechanistic expression of policy • http://nms.csail.mit.edu/rcc/ • About 100 downloads (70 network operators)
Outline • Correctness specification and constraints • Path visibility • Route validity • Safety • Proactive fault detection with rcc • Path visibility and route validity • Implementation and findings • Safety v. policy expressiveness • Local guarantees for safety • Implications
Safety The protocol does not oscillate If the protocol computes a stable path assignment for every initial state and message ordering, then safety is satisfied. Depends on the interactions of rankings and filters of multiple ASes Challenge: Guarantee safety with only “local” information (Preserve the autonomy of each AS.)
1 2 3 Safety: No Persistent Oscillation Depends on the interactions of rankings and filters of multiple ASes 1 3 0 1 0 0 2 1 0 2 0 3 2 0 3 0 Dispute wheel: global, cyclic relationship among rankings Varadhan, Govindan, & Estrin, “Persistent Route Oscillations in Interdomain Routing”, 1996 Griffin, Shepherd, & Wilfong, “The Stable Paths Problem and Interdomain Routing”, ToN, 2002
First Necessary Condition for Safety Safe No Dispute Ring Safe under Filtering No Dispute Wheel Dispute ring: Dispute wheel where each node only appears once We show: Dispute ring implies no safety under filtering Problem: “No dispute ring” is still a global condition.
Goal: Local Constraints for Safety Given no restrictions on filtering or topology, what are the local restrictions on rankings to guarantee globalsafety under filtering?
Autonomy Rankings (from single AS) ARC Function Accept/Reject
2 3 0 2 0 1 2 0 1 0 Node 1’s Rankings Node 2’s Rankings ARC Function Properties Permutation Invariance: Node labels don’t matter ARC Function Accept Accept Scale Invariance Adding new nodes does not force a node to change its rankings over old paths.
3*,2*,0* 1 2 3 1*, 3*, 0* 2*,1*,0* Examples of ARC Functions Accept only next-hop rankings • Captures most routing policies • Problem: system may not be safe (See Section 6.4 for proof) Accept only shortest hop count rankings • Guarantees safety under filtering • Problem: not expressive
What ARC Functions Violate Safety? Theorem. Permitting paths of length n+2 over paths of length n will violate safety under filtering. Theorem. Permitting paths of length n+1 over paths of length n will result in a dispute wheel. Proof Idea:Use the ARC function to construct a dispute ring (respectively, wheel). See Section 6.6.
Outline • Correctness specification and constraints • Path visibility • Route validity • Safety • Proactive fault detection with rcc • Path visibility and route validity • Implementation and findings • Safety v. policy expressiveness • Local guarantees for safety • Implications
Proactive Techniques rcc Static Analysis in the Workflow Detect Faults Predict Traffic Flow Configure Deploy Many faults can be detected with static analysis. Challenge: Adoption
RCP iBGP Preventing Errors in the First Place Before: conventional iBGP eBGP iBGP After: RCP gets “best” iBGP routes (and IGP topology) Feamster et al., “The Case for Separating Routing from Routers”, SIGCOMM FDNA, 2004 Caesar et al., “Design and Implementation of a Routing Control Platform”, NSDI, 2005
Safety: Possible Steps Forward • Add constraints on filtering • Relax autonomy of rankings • Restrict expressiveness: Shortest paths routing with autonomy for setting edge weights • Routing protocol converges on a fast timescale • Policy disputes (“tussle”) resolved on a slower timescale
Summary of Contributions • Correctness specification and constraints • rcc (“router configuration checker”) • Static configuration analysis tool for fault detection • Used by operators of large backbone networks • Analysis of real-world network configurations from 17 autonomous systems (ASes) • Route prediction using static configuration analysis • Sufficient and necessary conditions for safe routing
Known Constraints are Too Restrictive • Only three types of business relationships • Customer: filter none, rank highest • Peer: filter other peers and providers, rank second • Provider: filter other peers and providers, rank last Problems • Requires acyclic hierarchy (global condition) • Too restrictive to express important business relationships Sprint Abovenet Verio Customer PSINet Gao & Rexford, “Stable Internet Routing without Global Coordination”, IEEE/ACM ToN, 2001
The protocol does not oscillate Correctness Specification Path Visibility Every destination with a usable path has a route advertisement If there exists a path, then there exists a route Example violation: Network partition Route Validity Every route advertisement corresponds to a usable path If there exists a route, then there exists a path Example violation: Routing loop Safety The protocol converges to a stable path assignment for every possible initial state and message ordering Example violation: Oscillation
(Un)Related Work • Integrity & Consistency of Route Advertisements • S-BGP [Kent 2000], soBGP [White 2003], SPV [Hu 2004], Listen/Whisper [Subramanian 2004] • Model Checking/Formal Methods • Network protocols [Hajek 1978, Barghavan 2002] • Large programs [Musuvathi 2003] • Traffic Engineering • Intradomain [Fortz 2002] • Interdomain [Feamster 2003, Mahajan 2005] • Convergence Speed • Path exploration [Labovitz 1999], Route Flap Damping [Mao 2002]
Sprint AT&T Route Validity: Consistent Export • Settlement-free peering rules: • Advertise routes at all peering points • Advertised routes must have equal “AS path length” “equally good” routes
Inconsistent Export Observed at AT&T 15% of destinations inconsistent for >4 days Percentage of destinations with inconsistent routes Percentage of time Feamster et al., “BorderGuard: Detecting Cold Potatoes from Peers”. ACM IMC, October 2004.