Availability in Wide-Area Service Composition

Availability in Wide-Area Service Composition Bhaskaran Raman and Randy H. Katz SAHARA, EECS, U.C.Berkeley

10% of paths have only 95% availability Problem Statement Poor availability of wide-area (inter-domain) Internet paths [Labovitz, FTCS’99] BGP recovery can take several 10s of seconds [Labovitz, SIGCOMM’00]

Internet Source Composed services Destination Application plane Peering: exchange perf. info. Service cluster: compute cluster capable of running services Functionalities at the Cluster-Manager Peering relations, Overlay network Service-Level Path Creation, Maintenance, and Recovery Logical platform Link-State Propagation Finding Overlay Entry/Exit Location of Service Replicas Service clusters Hardware platform At-least -once UDP Perf. Meas. Liveness Detection Architecture

Time Timeout for failure detection Time Timeout period “Failure” detection in the Wide-Area • Two important characteristics: • Distbn. of outage periods • Rate of occurrence • Wide-Area traces • 12 pairs of hosts: Berkeley, Stanford, UIUC, CMU, TU-Berlin, UNSW • 300ms heart-beat • Approx. 2sec timeout • Low rate of occurrence (once an hour) • Good for many real-time applications

Key Design Points • Overlay size: how many nodes? • A comparison: Akamai cache servers • O(10,000) servers for Internet-wide operation • Probably a lesser number of data-center locations • Link-state floods: • Twice for each failure • For a 1,000-node graph; estimate #edges = 10,000 • Failures (>1.8 sec outage): O(once an hour) in the worst case • Only about 6 floods/second in the entire network! • Graph computation: • Modified version of Dijkstra’s for service composition • O(k*E*log(N)) computation time; k = #services composed • For 6,510-node network, this takes 50ms • Huge overhead, but: path caching helps • Memory: a few MB

Wide-Area experiments: setup • 8 nodes: • Berkeley, Stanford, UCSD, CMU • Cable modem (Berkeley) • DSL (San Francisco) • UNSW (Australia), TU-Berlin (Germany) • Text-to-speech composed sessions • Half with destinations at Berkeley, CMU • Half with recovery algo enabled, other half disabled • 4 paths in system at any time • Duration of session: 2min 30sec • Run for 4 days • Metric: loss-rate measured in 5sec intervals

Loss-rate for a pair of paths

CDF of loss-rates of all paths failed

CDF of gaps seen at client

Summary • Failure detection makes sense in ~2sec • Improvement in availability for real-time applications • Text-to-speech composed application • About 3.5-4 sec recovery time • 2,000ms failure detection timeout • 1,000ms recovery signaling • 500-1,000ms state restoration (re-process current text sentence) • Of the 2,872 paths, 18 were recovered (0.63%) • Availability: Number of 5sec periods with >10% outage: • Other issues: stability, scaling, load-balancing • Studied using Millennium emulation platform

Availability in Wide-Area Service Composition

Availability in Wide-Area Service Composition

Presentation Transcript

Wide Area Networking

Wide Area Networking

IB in the Wide Area

Wide Area Networks

Area-wide acaricides

S3fs in the wide area

Service Composition

Wide Area Networks

Performance and Availability in Wide-Area Service Composition

Service composition

Wide Area Networks

Wide Area Wireless

Wide Area Networking

Wide Area InfiniBand

Wide Area Networks

Wide-Area Service Composition: Evaluation of Availability and Scalability

IB in the Wide Area

Area Wide Optimization

Wide Area InfiniBand