Resilient Overlay Networks

Resilient Overlay Networks CS294-4 Presentation Nikita Borisov Sep 15, 2003

Internet Routing Inefficient • BGP is designed for scalability, sacrificing performance • Link outages common, but routing tables take minutes to update • Summarized data creates inefficient paths • No response to congestion

Network Redundancies

Network Redundancies • Multiple paths exist between most hosts • Many are not advertised due to private peering • Link outages lead to non-transitive reachability • A and C can’t reach each other but B can reach them both • Indirect paths often offer better performance • (though possibly violate AUPs)

RON goals • Fast failure detection and recovery • Seconds, not minutes • Integration with application • Optimize routes for latency, throughput, etc. • Fine-grained policy specification • E.g. keep commercial traffic off Internet2

Overlay Network • Small network - 3-50 nodes • Continuous measurement of each pairwise link • Connectivity/performance stats distributed globally • Pick best path out of direct and indirect ones • Restrict search to one indirect hop

Failure Detection • Active monitoring • Send probes on each virtual link • One probe every 14s • Fast timeout probes if one is lost • Detect failure in under 20s • Faster than any TCP timeout • Good enough for even human scale

Performance Metrics • Estimate latency based on RTT of probes • Moving weighted average • Assume latency is symmetric • Estimate loss rate based on probes received • Average of last 100 samples • Estimate TCP throughput • Model TCP performance based on latency and loss rate

Path Selection • Always route around outages • Application can optimize for latency, loss rate, throughput • Throughput hard to optimize • Avoid bad-throughput routes instead • Exhaustively search all one-hop paths • Introduce hysteresis to prevent “route flapping”

Routing Policy • Policies specify which virtual links to use • Separate routing tables per policy • Packets classified with policy tag and routed accordingly • Sample policy: exclusive clique • Only members of clique can use links between each other • E.g. Internet2 hosts

Measurements • Two studies (RON1 and RON2) • RON recovers from 100% (RON1) or 60% (RON2) outages and high loss rates • Routes around bad throughput failures • Doubles TCP throughput in 5% of all samples • Reduces loss rate by 0.05 in 5% of samples

Performance Problems • RON worse in some cases • Measurement inaccuracies • Information propagation delays • Hysteresis • But … • RON win in most cases • RON loss never very large • RON win, though, can be dramatic

Overhead • Probing traffic - grows O(N) • Routing state traffic - grows O(N2) • Total BW consumed • 2.2Kbps with 10 nodes • 33Kbps with 50 nodes • A limiting factor for scaling

Question • Is this overhead excessive? • Less than 10% of a broadband link • What if RONs become more popular? • Is using a RON “cheating”?

Applications • Videoconferencing • Cooperating ISPs • Branch offices of companies • Others?

Discussion

Resilient Overlay Networks