210 likes | 333 Views
Improving the Reliability of Internet Paths with One-hop Source Routing. Krishna P.Gummadi Univ. of Washington OSDI 2004. Outline. One-line comment Problem Measurement study Approach Detailed-policy Real-world implementation Critique. One-line comment.
E N D
Improving the Reliability of Internet Paths with One-hop Source Routing Krishna P.Gummadi Univ. of Washington OSDI 2004
Outline • One-line comment • Problem • Measurement study • Approach • Detailed-policy • Real-world implementation • Critique
One-line comment • Improving the Reliability of Internet Paths simply with Scalable One-hop Source Routing (SOSR) destination source
B A Network C D Problem • Internet reliability demands increase • However, the reliability falls FAR short of the “five 9s” • Encountering a path failure: 1.5 ~ 3.3% • Long recovery time • Suggested solutions • Server replications • Expensive – limited to high-end web sites • Multi-homing • BGP fail-over time is LONG [Labovitz 00] • Overlay routing network (ex: RON) • Monitoring/Selecting paths incurs high overhead Any Simple, Scalable Solutions??
Measurement study on Environment • The realities of internet path failures • Measure the availability of Internet paths broadly • Frequency, Duration of failures • Find the potential of their approach - SOSR • 2 important factors of the performance • Location of failures, success rate of alternate paths
Methodology • Requesters: 67 Planet Lab nodes (monitor internet paths) • Destination : three different set of hosts 3153 =378 popular Web Servers + 1139 Broadband Hosts + 1636 Randomly selected IP addresses (Comparison)
Methodology Assigned destination • Probe every 15 seconds • Probe every 5 seconds after a loss(no response within 3 sec) • 3 consecutive loss failure • All other observers probe together (to check alternative paths) • Path recovery: 10 consecutive response after the failure PL observer
Measured facts • Availability (Frequency, Duration) • “7 day study saw more failures than RON saw in 9 months!” • On average each path failed at least once per week. • 20% of all server paths were fault free • 12% of all broadband host paths were fault free Web server paths : 99.6% availability Broadband host paths: 94.4% availability
Measured facts • Location of failures • 4 different parts of Internet paths • src_side, backbone, dst_side, last_hop • Effects the number of alternative paths • Backbone: path diversity • Last-hop: no choice Destination last hop dst_side Backbone (Core) src_side Observer
Measured facts • Success rate of other observers during path failures • Can recover from failure through the alternative paths • Select the node as an intermediary node PL Observer X Destination PL Observer1
Approach - One Hop Source Routing Observer 4 Observer1 Requester Observer2 Destination How should we select intermediaries? Observer 3
Which intermediary? • Number of useful intermediaries 100%-20%=80% 21 or more nodes Let’s pick k intermediary nodes randomly! No state maintenance!
How many intermediaries? Knee in the graph is at k = 4 4 intermediaries are enough! Less overhead, high recovery rate
Approach - One Hop Source Routing Observer 4 Observer1 Observer 5 Requester Observer2 Let’s select 4 intermediaries randomly! Destination Observer 3
Result of random-4 • For Servers it recovered from • 50% of near-source side failures • 89% of middle core failures! • 72% of destination side failures! • 40% of last hop failures
Improving random-k • Assumption: Disjoint path can recover from failure • Doesn’t share the failed link • 1. History-k • (random k-1) + recently succeeded node (assumed to be disjoint) • 2. BGP-paths-k • Try to use the most disjoint path for recovery • Select the paths with smallest ASs in common • Have to sort intermediaries by the number of common ASs
Real-world implementation Requester Intermediary
The Test • 3 machines running wget at U. of Washington • 982 web servers, 1 webpage fetched per sec for 3 days. • 273,000 total requests • All machines fetched same page at same time • To share the path failure • 3 techniques: wget, wget-sosr and wget-aggressiveTCP
Result • Failure rate were only 0.18% • wget-SOSR: Recovered from 56% of network level failures • However, due to applications failures • overall recovery rate is 20%
Critique • Strong Point • A new approach without overlay network • Stateless, Simple, Scalable • Weak Points • Latency • The simple approach doesn’t consider latency • Response time of a web page is critical! • Supporting stateful connections • Many web-based communications are stateful • Need some improvement to support stateful connection recovery • Limited to only a few application • Users have to install “additionally” for the applications