1 / 23

Infrastructure-based Resilient Routing

Infrastructure-based Resilient Routing. Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley ICSI Lunch Seminar, January 2004. Motivation. Network connectivity is not reliable

posy
Download Presentation

Infrastructure-based Resilient Routing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Infrastructure-basedResilient Routing Ben Y. Zhao, Ling Huang, Jeremy Stribling, Anthony Joseph and John Kubiatowicz University of California, Berkeley ICSI Lunch Seminar, January 2004

  2. Motivation • Network connectivity is not reliable • Disconnections frequent in the Internet (UMichTR98,IMC02) • 50% of backbone links have MTBF < 10 days • 20% of faults last longer than 10mins • IP-level repair relatively slow • Wide-area: BGP  3 mins • Local-area: IS-IS  5 seconds • Next generation wide-area network applications • Streaming media, VoIP, B2B transactions • Low tolerance of delay, jitter and faults ravenben@eecs.berkeley.edu

  3. The Challenge • Routing failures are diverse • Many causes • Misconfigurations, cut fiber, planned downtime, software bugs • Occur anywhere with local or global impact: • Single fiber cut can disconnect AS pairs • One event can lead to complex protocol interactions • Isolating failures is difficult • End user symptoms often dynamic or intermittent • WAN measurement research is ongoing (Rocketfuel, etc) • Observations: • Fault detection from multiple distributed vantage points • In-network decision making necessary for timely responses ravenben@eecs.berkeley.edu

  4. Talk Overview • Motivation • A structured overlay approach • Mechanisms and policy • Evaluation • Some questions ravenben@eecs.berkeley.edu

  5. An Infrastructure Approach • Our goals • Resilient overlay to route around failures • Respond in milliseconds (not seconds) • Our approach (data & control plane) • Nodes are observation points (similar to Plato’s NEWS service) • Nodes are also points of traffic redirection(forwarding path determination and data forwarding) • No edge node involvement • Fast response time, security focused on infrastructure • Fully transparent, no application awareness necessary ravenben@eecs.berkeley.edu

  6. Why Structured Overlays • Resilient Overlay Networks (MIT) • Fully connected mesh • Each node has full knowledge of network • Fast, independent calculation of routes • Nodes can construct any path, maximum flexibility • Cost of flexibility • Protocol needs to choose the “right” route/nodes • Per node O(n) state • Monitors n - 1 paths • O(n2) total path monitoring is expensive D S ravenben@eecs.berkeley.edu

  7. O V E R L A Y v v v v v v v v v v v v v The Big Picture • Locate nearby overlay proxy • Establish overlay path to destination host • Overlay traffic routes traffic resiliently Internet ravenben@eecs.berkeley.edu

  8. register register get (hash(B)) P’(B) put (hash(B), P’(B)) put (hash(A), P’(A)) Traffic Tunneling Legacy Node B Legacy Node A B P’(B) A, B are IP addresses Proxy Proxy Structured Peer to Peer Overlay • Store mapping from end host IP to its proxy’s overlay ID • Similar to approach in Internet Indirection Infrastructure (I3) ravenben@eecs.berkeley.edu

  9. Pros and Cons • Leverage small neighbor sets • Less neighbor paths to monitor: O(n)  O(log(n)) • Reduction in probing bandwidth • Faster fault detection • Actively maintain static route redundancy • Manageable for “small” # of paths • Redirect traffic immediately when a failure is detectedEliminate on-the-fly calculation of new routes • Restore redundancy in background after failure • Fast fault detection + precomputed paths = more responsiveness • Cons: overlay imposes routing stretch (mostly < 2) ravenben@eecs.berkeley.edu

  10. In-network Resiliency Details • Active periodic probes for fault-detection • Exponentially weighted moving average link quality estimation • Avoid route flapping due to short term loss artifacts • Loss rate Ln = (1 - )  Ln-1 +  p • Simple approach taken, much ongoing research • Smart fault-detection / propagation (Zhuang04) • Intelligent and cooperative path selection (Seshardri04) • Maintaining backup paths • Create and store backup routes at node insertion • Query neighbors after failures to restore redundancy • Ask any neighbor at or above routing level of faulty nodee.g. ABCD sees ABDE failed, can ask any AB?? node for info • Simple policies to choose among redundant paths ravenben@eecs.berkeley.edu

  11. First Reachable Link Selection (FRLS) • Use link quality estimation to choose shortest “usable” path • Use shortest path withminimal quality > T • Correlated failures • Reduce with intelligent topology construction • Goal: leverage redundancy available ravenben@eecs.berkeley.edu

  12. Evaluation • Metrics for evaluation • How much routing resiliency can we exploit? • How fast can we adapt to faults (responsiveness)? • Experimental platforms • Event-based simulations on transit stub topologies • Data collected over multiple 5000-node topologies • PlanetLab measurements • Microbenchmarks on responsiveness ravenben@eecs.berkeley.edu

  13. Exploiting Route Redundancy (Sim) • Simulation of Tapestry, 2 backup paths per routing entry • Transit-stub topology shown, results from TIER and AS graphs similar ravenben@eecs.berkeley.edu

  14. 660 300 Responsiveness to Faults (PlanetLab) • = 0.2 • = 0.4 • Two reasonable values for filter constant  • Response time scales linearly to probe period ravenben@eecs.berkeley.edu

  15. Link Probing Bandwidth (Planetlab) • Bandwidth increases logarithmically with overlay size • Medium sized routing overlays incur low probing bandwidth ravenben@eecs.berkeley.edu

  16. Conclusion • Trading flexibility for scalability and responsiveness • Structured routing has low path maintenance costs • Allows “caching” of backup paths for quick failover • Can no longer construct arbitrary paths • But simple policy exploits available redundancy well • Fast enough for most interactive applications • 300ms beacon period  response time < 700ms • ~300 nodes, b/w cost = 7KB/s ravenben@eecs.berkeley.edu

  17. Ongoing Questions • Is this the right approach? • Is there a lower bound on desired responsiveness? • Is this responsive enough for VoIP? • If not, is multipath routing the solution? • What about deployment issues? • How does inter-domain deployment happen? • A third-party approach? (Akamai for routing) ravenben@eecs.berkeley.edu

  18. Related Work • Redirection overlays • Detour (IEEE Micro 99) • Resilient Overlay Networks (SOSP 01) • Internet Indirection Infrastructure (SIGCOMM 02) • Secure Overlay Services (SIGCOMM 02) • Topology estimation techniques • Adaptive probing (IPTPS 03) • Internet tomography (IMC 03) • Routing underlay (SIGCOMM 03) • Many, many other structured peer-to-peer overlays Thanks to Dennis Geels / Sean Rhea for their work on BMark ravenben@eecs.berkeley.edu

  19. Backup Slides ravenben@eecs.berkeley.edu

  20. Another Perspective on Reachability Portion of all pair-wise paths where no failure-free paths remain A path exists, but neither IP nor FRLS can locate the path Portion of all paths where IP and FRLS both route successfully FRLS finds path, where short-term IP routing fails ravenben@eecs.berkeley.edu

  21. Constrained Multicast • Used only when all paths are below quality threshold • Send duplicate messages on multiple paths • Leverage route convergence • Assign unique messageIDs • Mark duplicates • Keep moving window of IDs • Recognize and drop duplicates • Limitations • Assumes loss not from congestion • Ideal for local area routing 2225 2299 2274 2286 2046 2281 2530 ? ? ? 1111 ravenben@eecs.berkeley.edu

  22. Latency Overhead of Misrouting ravenben@eecs.berkeley.edu

  23. Bandwidth Cost of Constrained Multicast ravenben@eecs.berkeley.edu

More Related