230 likes | 318 Views
Infrastructure-based Resilient Routing. Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley ICSI Lunch Seminar, January 2004. Motivation. Network connectivity is not reliable
E N D
Infrastructure-basedResilient Routing Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley ICSI Lunch Seminar, January 2004
Motivation • Network connectivity is not reliable • Disconnections frequent in the Internet (UMichTR98,IMC02) • 50% of backbone links have MTBF < 10 days • 20% of faults last longer than 10mins • IP-level repair relatively slow • Wide-area: BGP 3 mins • Local-area: IS-IS 5 seconds • Next generation wide-area network applications • Streaming media, VoIP, B2B transactions • Low tolerance of delay, jitter and faults ravenben@eecs.berkeley.edu
The Challenge • Routing failures are diverse • Many causes • Misconfigurations, cut fiber, planned downtime, software bugs • Occur anywhere with local or global impact: • Single fiber cut can disconnect AS pairs • One event can lead to complex protocol interactions • Isolating failures is difficult • End user symptoms often dynamic or intermittent • WAN measurement research is ongoing (Rocketfuel, etc) • Observations: • Fault detection from multiple distributed vantage points • In-network decision making necessary for timely responses ravenben@eecs.berkeley.edu
Talk Overview • Motivation • A structured overlay approach • Mechanisms and policy • Evaluation • Some questions ravenben@eecs.berkeley.edu
An Infrastructure Approach • Our goals • Resilient overlay to route around failures • Respond in milliseconds (not seconds) • Our approach (data & control plane) • Nodes are observation points (similar to Plato’s NEWS service) • Nodes are also points of traffic redirection(forwarding path determination and data forwarding) • No edge node involvement • Fast response time, security focused on infrastructure • Fully transparent, no application awareness necessary ravenben@eecs.berkeley.edu
Why Structured Overlays • Resilient Overlay Networks (MIT) • Fully connected mesh • Each node has full knowledge of network • Fast, independent calculation of routes • Nodes can construct any path, maximum flexibility • Cost of flexibility • Protocol needs to choose the “right” route/nodes • Per node O(n) state • Monitors n - 1 paths • O(n2) total path monitoring is expensive D S ravenben@eecs.berkeley.edu
O V E R L A Y v v v v v v v v v v v v v The Big Picture • Locate nearby overlay proxy • Establish overlay path to destination host • Overlay traffic routes traffic resiliently Internet ravenben@eecs.berkeley.edu
register register get (hash(B)) P’(B) put (hash(B), P’(B)) put (hash(A), P’(A)) Traffic Tunneling Legacy Node B Legacy Node A B P’(B) A, B are IP addresses Proxy Proxy Structured Peer to Peer Overlay • Store mapping from end host IP to its proxy’s overlay ID • Similar to approach in Internet Indirection Infrastructure (I3) ravenben@eecs.berkeley.edu
Pros and Cons • Leverage small neighbor sets • Less neighbor paths to monitor: O(n) O(log(n)) • Reduction in probing bandwidth • Faster fault detection • Actively maintain static route redundancy • Manageable for “small” # of paths • Redirect traffic immediately when a failure is detectedEliminate on-the-fly calculation of new routes • Restore redundancy in background after failure • Fast fault detection + precomputed paths = more responsiveness • Cons: overlay imposes routing stretch (mostly < 2) ravenben@eecs.berkeley.edu
In-network Resiliency Details • Active periodic probes for fault-detection • Exponentially weighted moving average link quality estimation • Avoid route flapping due to short term loss artifacts • Loss rate Ln = (1 - ) Ln-1 + p • Simple approach taken, much ongoing research • Smart fault-detection / propagation (Zhuang04) • Intelligent and cooperative path selection (Seshardri04) • Maintaining backup paths • Create and store backup routes at node insertion • Query neighbors after failures to restore redundancy • Ask any neighbor at or above routing level of faulty nodee.g. ABCD sees ABDE failed, can ask any AB?? node for info • Simple policies to choose among redundant paths ravenben@eecs.berkeley.edu
First Reachable Link Selection (FRLS) • Use link quality estimation to choose shortest “usable” path • Use shortest path withminimal quality > T • Correlated failures • Reduce with intelligent topology construction • Goal: leverage redundancy available ravenben@eecs.berkeley.edu
Evaluation • Metrics for evaluation • How much routing resiliency can we exploit? • How fast can we adapt to faults (responsiveness)? • Experimental platforms • Event-based simulations on transit stub topologies • Data collected over multiple 5000-node topologies • PlanetLab measurements • Microbenchmarks on responsiveness ravenben@eecs.berkeley.edu
Exploiting Route Redundancy (Sim) • Simulation of Tapestry, 2 backup paths per routing entry • Transit-stub topology shown, results from TIER and AS graphs similar ravenben@eecs.berkeley.edu
660 300 Responsiveness to Faults (PlanetLab) • = 0.2 • = 0.4 • Two reasonable values for filter constant • Response time scales linearly to probe period ravenben@eecs.berkeley.edu
Link Probing Bandwidth (Planetlab) • Bandwidth increases logarithmically with overlay size • Medium sized routing overlays incur low probing bandwidth ravenben@eecs.berkeley.edu
Conclusion • Trading flexibility for scalability and responsiveness • Structured routing has low path maintenance costs • Allows “caching” of backup paths for quick failover • Can no longer construct arbitrary paths • But simple policy exploits available redundancy well • Fast enough for most interactive applications • 300ms beacon period response time < 700ms • ~300 nodes, b/w cost = 7KB/s ravenben@eecs.berkeley.edu
Ongoing Questions • Is this the right approach? • Is there a lower bound on desired responsiveness? • Is this responsive enough for VoIP? • If not, is multipath routing the solution? • What about deployment issues? • How does inter-domain deployment happen? • A third-party approach? (Akamai for routing) ravenben@eecs.berkeley.edu
Related Work • Redirection overlays • Detour (IEEE Micro 99) • Resilient Overlay Networks (SOSP 01) • Internet Indirection Infrastructure (SIGCOMM 02) • Secure Overlay Services (SIGCOMM 02) • Topology estimation techniques • Adaptive probing (IPTPS 03) • Internet tomography (IMC 03) • Routing underlay (SIGCOMM 03) • Many, many other structured peer-to-peer overlays Thanks to Dennis Geels / Sean Rhea for their work on BMark ravenben@eecs.berkeley.edu
Backup Slides ravenben@eecs.berkeley.edu
Another Perspective on Reachability Portion of all pair-wise paths where no failure-free paths remain A path exists, but neither IP nor FRLS can locate the path Portion of all paths where IP and FRLS both route successfully FRLS finds path, where short-term IP routing fails ravenben@eecs.berkeley.edu
Constrained Multicast • Used only when all paths are below quality threshold • Send duplicate messages on multiple paths • Leverage route convergence • Assign unique messageIDs • Mark duplicates • Keep moving window of IDs • Recognize and drop duplicates • Limitations • Assumes loss not from congestion • Ideal for local area routing 2225 2299 2274 2286 2046 2281 2530 ? ? ? 1111 ravenben@eecs.berkeley.edu
Latency Overhead of Misrouting ravenben@eecs.berkeley.edu
Bandwidth Cost of Constrained Multicast ravenben@eecs.berkeley.edu