140 likes | 161 Views
RON is a fault-tolerant network enabling efficient routing around failures, improving availability by 10x through path monitoring and rerouting. Experimental data shows significant resilience and rapid recovery from outages.
E N D
RON: Resilient Overlay Networks David Andersen, Hari Balakrishnan, Frans Kaashoek, Robert Morris MIT Laboratory for Computer Science http://nms.lcs.mit.edu/ron/
Network Fault-tolerant Networking B A C D Any-to-any communication, routing around failures
AS AS Transit AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS The Internet Mom-and-popISP Really-big ISP everyone’s afraid of Big ISP Autonomous System (AS) Peering BGP4 Scalability via aggressive aggregation and information hiding Commercial reality via peering & transit relationships
How Robust is Internet Routing? • Slow outage detection and recovery • Inability to detect badly performing paths • Inability to efficiently leverage redundant paths • Inability to perform application-specific routing • Inability to express sophisticated routing policy
Our Goal To improve communication availability for small groups by at least a factor or 10 • Many applications • Collaboration and conferencing • Virtual Private Networks (VPNs) across public Internet • Overlay Internet Service
Reliability via path monitoring and re-routing Reliability via path monitoring and re-routing RON: Routing Using Overlays • Cooperating end-systems in different routing domains can conspire to do better than scalable wide-area protocols Scalable BGP-based IP routing substrate • Types of failures • Outages: Configuration/operational errors, backhoes, etc. • Performance failures: Severe congestion, denial-of-service attacks, etc.
Conduit Conduit Forwarder Forwarder Router Prober Router Prober Link-state routing protocol, disseminates info using RON! RON Design Nodes in different routing domains (ASes) RON library Performance Database Application-specific routing tables Policy routing module
Many Research Questions • Does the RON approach work at all? • Each RON is small in size, no more than 50 or 100 nodes • How fast can failure detection & recovery happen? • Policy routing • Doesn’t RON violate AUPs and other policies? • Routing behavior • Can stable routing be achieved? • Implementing efficient multi-criteria routing • Is it safe to deploy a large number of (small) interacting RONs on the Internet?
RON Deployment (19 sites) To vu.nl lulea.se ucl.uk To kaist.kr, .ve .com (ca), .com (ca), dsl (or), cci (ut), aros (ut), utah.edu, .com (tx) cmu (pa), dsl (nc), nyu , cornell, cable (ma), cisco (ma), mit, vu.nl, lulea.se, ucl.uk, kaist.kr, univ-in-venezuela
RON Experiments • Measure loss, latency, and throughput with and without RON • 13 hosts in the US and Europe • 3 days of measurements from data collected in March 2001 • 30-minute average loss rates • A 30 minute outage is very serious! • Note: Experiments done with “No-Internet2-for-commercial-use” policy
RON greatly improves loss-rate 30-min average loss rate on Internet RON loss rate never more than 30% 13,000 samples 30-min average loss rate with RON
An order-of-magnitude fewer failures 30-minute average loss rates 6,825 “path hours” represented here 12 “path hours” of essentially complete outage 76 “path hours” of TCP outage RON routed around all of these! One indirection hop provides almost all the benefit!
Conclusion • Improved availability of Internet communication paths using small overlays • Layered above scalable IP substrate • RON provides a set of libraries and programs to facilitate this application-specific routing • Experimental data suggest that this approach works • Over 10X availability • Outage detection and recovery in about 15 seconds • Able to route around certain denial-of-service attacks • Many interesting questions remain… http://nms.lcs.mit.edu/ron/