100 likes | 185 Views
Availability in Wide-Area Service Composition. Bhaskaran Raman and Randy H. Katz SAHARA, EECS, U.C.Berkeley. 10% of paths have only 95% availability. Problem Statement. Poor availability of wide-area (inter-domain) Internet paths. [Labovitz, FTCS’99].
E N D
Availability in Wide-Area Service Composition Bhaskaran Raman and Randy H. Katz SAHARA, EECS, U.C.Berkeley
10% of paths have only 95% availability Problem Statement Poor availability of wide-area (inter-domain) Internet paths [Labovitz, FTCS’99] BGP recovery can take several 10s of seconds [Labovitz, SIGCOMM’00]
Internet Source Composed services Destination Application plane Peering: exchange perf. info. Service cluster: compute cluster capable of running services Functionalities at the Cluster-Manager Peering relations, Overlay network Service-Level Path Creation, Maintenance, and Recovery Logical platform Link-State Propagation Finding Overlay Entry/Exit Location of Service Replicas Service clusters Hardware platform At-least -once UDP Perf. Meas. Liveness Detection Architecture
Time Timeout for failure detection Time Timeout period “Failure” detection in the Wide-Area • Two important characteristics: • Distbn. of outage periods • Rate of occurrence • Wide-Area traces • 12 pairs of hosts: Berkeley, Stanford, UIUC, CMU, TU-Berlin, UNSW • 300ms heart-beat • Approx. 2sec timeout • Low rate of occurrence (once an hour) • Good for many real-time applications
Key Design Points • Overlay size: how many nodes? • A comparison: Akamai cache servers • O(10,000) servers for Internet-wide operation • Probably a lesser number of data-center locations • Link-state floods: • Twice for each failure • For a 1,000-node graph; estimate #edges = 10,000 • Failures (>1.8 sec outage): O(once an hour) in the worst case • Only about 6 floods/second in the entire network! • Graph computation: • Modified version of Dijkstra’s for service composition • O(k*E*log(N)) computation time; k = #services composed • For 6,510-node network, this takes 50ms • Huge overhead, but: path caching helps • Memory: a few MB
Wide-Area experiments: setup • 8 nodes: • Berkeley, Stanford, UCSD, CMU • Cable modem (Berkeley) • DSL (San Francisco) • UNSW (Australia), TU-Berlin (Germany) • Text-to-speech composed sessions • Half with destinations at Berkeley, CMU • Half with recovery algo enabled, other half disabled • 4 paths in system at any time • Duration of session: 2min 30sec • Run for 4 days • Metric: loss-rate measured in 5sec intervals
Summary • Failure detection makes sense in ~2sec • Improvement in availability for real-time applications • Text-to-speech composed application • About 3.5-4 sec recovery time • 2,000ms failure detection timeout • 1,000ms recovery signaling • 500-1,000ms state restoration (re-process current text sentence) • Of the 2,872 paths, 18 were recovered (0.63%) • Availability: Number of 5sec periods with >10% outage: • Other issues: stability, scaling, load-balancing • Studied using Millennium emulation platform