200 likes | 328 Views
Latency as a Performability Metric for Internet Services. Pete Broadwell pbwell@cs.berkeley.edu. Outline. Performability background/review Latency-related concepts Project status Initial test results Current issues. 9. 9. 9. 9. 9. Motivation.
E N D
Latency as a Performability Metric for Internet Services Pete Broadwell pbwell@cs.berkeley.edu
Outline • Performability background/review • Latency-related concepts • Project status • Initial test results • Current issues
9 9 9 9 9 Motivation • A goal of ROC project: develop metrics to evaluate new recovery techniques • Problem: basic concept of availability assumes system is either “up” or “down” at a given time • “Nines” only describe fraction of uptime over a certain interval
Why Is Availability Insufficient? • Availability doesn’t describe durations or frequencies of individual outages • Both can strongly influence user perception of service, as well as revenue • Availability doesn’t capture system’s capacity to support degraded service • degraded performance during failures • reduced data quality during high load (Web)
What is “performability”? • Combination of performance and dependability measures • Classical defn: probabilistic (model-based) measure of a system’s “ability to perform” in the presence of faults1 • Concept from traditional fault-tolerant systems community, ca. 1978 • Has since been applied to other areas, but still not in widespread use 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
D=number of data disks pi(t)=probability that system is in state i at time t wi(t) =reward (disk I/O operations/sec) m = disk repair rate l = failure rate of a single disk drive Performability Example Discrete-time Markov chain (DTMC) model of a RAID-5 disk array1 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997
FAILURE Normal throughput RECOVER Average throughput REPAIR DETECT Degraded throughput Visualizing Performability Throughput I/O operations/sec Time
Perf Time Metrics for Web Services • Throughput - requests/sec • Latency – render time, time to first byte • Data quality • harvest (response completeness) • yield (% queries answered)1 1 E. Brewer, Lessons from Giant-Scale Internet Services, 2001
Applications of Metrics • Modeling the expected failure-related performance of a system, prior to deployment • Benchmarking the performance of an existing system during various recovery phases • Comparing the reliability gains offered by different recovery strategies
Related Projects • HP: Automating Data Dependability • uses “time to data access” as one objective for storage systems • Rutgers: PRESS/Mendosus • evaluated throughput of PRESS server during injected failures • IBM: Autonomic Storage • Numerous ROC projects
Arguments for Using Latency as a Metric • Originally, performability metrics were meant to capture end-user experience1 • Latency better describes the experience of an end user of a web site • response time >8 sec = site abandonment = lost income $$2 • Throughput describes the raw processing ability of a service • best used to quantify expenses 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994 2 Zona Research and Keynote Systems, The Need for Speed II, 2001
Current Progress • Using Mendosus fault injection system on a 4-node PRESS web server (both from Rutgers) • Running latency-based performability tests on the cluster • Inject faults during load test • Record page-load times before, during and after faults
Cachinginfo Request Response Test Setup PRESS web server + Mendosus Test clients Page Emulatedswitch Normal version: cooperative caching HA version: cooperative caching + heartbeat monitoring
Effect of Component Failure on Performability Metrics Perform- ability metric Throughput Latency Time FAILURE REPAIR
Observations • Below saturation, throughput is more dependent on load than latency • Above saturation, latency is more dependent on load Thru = 3/s Lat = .14s Thru = 6/s Lat = .14s Thru = 7/s Lat = .4s 1 2 3 4 5 Time
How to Represent Latency? • Average response time over a given time period • Make a distinction between “render time” & “time to first byte”? • Deviation from baseline latency • Impose a greater penalty for deviations toward longer wait times?
X users get “server too busy” msg Response Time with Load Shedding Policy Responsetime (sec) Abandonment threshold 8s Load-shedding threshold Time FAILURE REPAIR
Load Shedding Issues • Load shedding means returning 0% data quality – a different kind of performability metric • To combine load shedding and latency, define a “demerit” system: • Such systems quickly lose generality, however - “Server too busy” msg – 3 demerits - 8 sec response time – 1 demerit/sec
Further Work • Collect more experimental results! • Compare throughput and latency-based results of normal and high-availability versions of PRESS • Evaluate usefulness of “demerit” systems to describe the user experience (latency and data quality)
Latency as a Performability Metric for Internet Services Pete Broadwell pbwell@cs.berkeley.edu