Improving the Reliability of Internet Paths with One-hop Source Routing

Improving the Reliability of Internet Paths with One-hop Source Routing Krishna P.Gummadi Univ. of Washington OSDI 2004

Outline • One-line comment • Problem • Measurement study • Approach • Detailed-policy • Real-world implementation • Critique

One-line comment • Improving the Reliability of Internet Paths simply with Scalable One-hop Source Routing (SOSR) destination source

B A Network C D Problem • Internet reliability demands increase • However, the reliability falls FAR short of the “five 9s” • Encountering a path failure: 1.5 ~ 3.3% • Long recovery time • Suggested solutions • Server replications • Expensive – limited to high-end web sites • Multi-homing • BGP fail-over time is LONG [Labovitz 00] • Overlay routing network (ex: RON) • Monitoring/Selecting paths incurs high overhead Any Simple, Scalable Solutions??

Measurement study on Environment • The realities of internet path failures • Measure the availability of Internet paths broadly • Frequency, Duration of failures • Find the potential of their approach - SOSR • 2 important factors of the performance • Location of failures, success rate of alternate paths

Methodology • Requesters: 67 Planet Lab nodes (monitor internet paths) • Destination : three different set of hosts 3153 =378 popular Web Servers + 1139 Broadband Hosts + 1636 Randomly selected IP addresses (Comparison)

Methodology Assigned destination • Probe every 15 seconds • Probe every 5 seconds after a loss(no response within 3 sec) • 3 consecutive loss  failure • All other observers probe together (to check alternative paths) • Path recovery: 10 consecutive response after the failure PL observer

Measured facts • Availability (Frequency, Duration) • “7 day study saw more failures than RON saw in 9 months!” • On average each path failed at least once per week. • 20% of all server paths were fault free • 12% of all broadband host paths were fault free Web server paths : 99.6% availability Broadband host paths: 94.4% availability

Measured facts • Location of failures • 4 different parts of Internet paths • src_side, backbone, dst_side, last_hop • Effects the number of alternative paths • Backbone: path diversity • Last-hop: no choice  Destination last hop dst_side Backbone (Core) src_side Observer

Measured facts • Success rate of other observers during path failures • Can recover from failure through the alternative paths • Select the node as an intermediary node PL Observer X Destination PL Observer1

Approach - One Hop Source Routing Observer 4 Observer1 Requester Observer2 Destination How should we select intermediaries? Observer 3

Which intermediary? • Number of useful intermediaries 100%-20%=80% 21 or more nodes Let’s pick k intermediary nodes randomly!  No state maintenance!

How many intermediaries? Knee in the graph is at k = 4 4 intermediaries are enough!  Less overhead, high recovery rate

Approach - One Hop Source Routing Observer 4 Observer1 Observer 5 Requester Observer2 Let’s select 4 intermediaries randomly! Destination Observer 3

Result of random-4 • For Servers it recovered from • 50% of near-source side failures • 89% of middle core failures! • 72% of destination side failures! • 40% of last hop failures

Improving random-k • Assumption: Disjoint path can recover from failure • Doesn’t share the failed link • 1. History-k • (random k-1) + recently succeeded node (assumed to be disjoint) • 2. BGP-paths-k • Try to use the most disjoint path for recovery • Select the paths with smallest ASs in common • Have to sort intermediaries by the number of common ASs

Real-world implementation Requester Intermediary

The Test • 3 machines running wget at U. of Washington • 982 web servers, 1 webpage fetched per sec for 3 days. • 273,000 total requests • All machines fetched same page at same time • To share the path failure • 3 techniques: wget, wget-sosr and wget-aggressiveTCP

Result • Failure rate were only 0.18% • wget-SOSR: Recovered from 56% of network level failures • However, due to applications failures • overall recovery rate is 20%

Critique • Strong Point • A new approach without overlay network • Stateless, Simple, Scalable • Weak Points • Latency • The simple approach doesn’t consider latency • Response time of a web page is critical! • Supporting stateful connections • Many web-based communications are stateful • Need some improvement to support stateful connection recovery • Limited to only a few application • Users have to install “additionally” for the applications

Some optimizations

Improving the Reliability of Internet Paths with One-hop Source Routing

Improving the Reliability of Internet Paths with One-hop Source Routing

Presentation Transcript

Internet Routing

Internet Routing

Improving the Reliability of Commodity Operating Systems

A survey of Internet routing reliability

Internet Routing

Algebra and algorithms for QoS path computation and hop-by-hop routing in the internet

Designs with One Source of Variation

Next-hop Policy Routing

IMPROVING THE PS FWS RELIABILITY

Single-Source Shortest Paths

Multi-Source Shortest Paths

Improving the Reliability of Refactoring Engines

Energy-Efficient Routing with Reliability Constraint

IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS

Single Source Shortest Paths

Internet Routing

Improving Reliability

Improving the Reliability of Commodity Operating Systems

Source Reliability

Single-Source Shortest Paths