250 likes | 567 Views
Latency as a Performability Metric: Experimental Results. Pete Broadwell pbwell@cs.berkeley.edu. Outline. Motivation and background Performability overview Project summary Test setup PRESS web server Mendosus fault injection system Experimental results & analysis
E N D
Latency as a Performability Metric: Experimental Results Pete Broadwell pbwell@cs.berkeley.edu
Outline • Motivation and background • Performability overview • Project summary • Test setup • PRESS web server • Mendosus fault injection system • Experimental results & analysis • How to represent latency • Questions for future research
Performability overview • Goal of ROC project: develop metrics to evaluate new recovery techniques • Performability – class of metrics to describe how a system performs in the presence of faults • First used in fault-tolerant computing field1 • Now being applied to online services 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994
Example: microbenchmark RAID disk failure
Project motivation • Rutgers study: performability analysis of a web server, using throughput • Other studies (esp. from HP Labs Storage group) also use response time as a metric • Assertion: latency and data quality are better than throughput for describing user experience • How best to represent latency in performability reports?
Project overview • Goals: • Replicate PRESS/Mendosus study with response time measurements • Discuss how to incorporate latency into performability statistics • Contributions: • Provide a latency-based analysis of a web server’s performability (currently rare) • Further the development of more comprehensive dependability benchmarks
Experiment components • The Mendosus fault injection system • From Rutgers (Rich Martin) • Goal: low-overhead emulation of a cluster of workstations, injection of likely faults • The PRESS web server • Cluster-based, uses cooperative caching. Designed by Carreira et al. (Rutgers) • Perf-PRESS: basic version • HA-PRESS: incorporates hearbeats, master node for automated cluster management • Client simulators • Submit set # of requests/sec, based on real traces
Mendosus design Workstations (real or VMs) Global Controller (Java) ModifiedNICdriver SCSImodule procmodule Apps config file LAN emu config file Fault config file User-leveldaemon (Java) apps Emulated LAN
Test case timeline - Warm-up time: 30-60 seconds - Time to repair: up to 90 seconds
Simplifying assumptions • Operator repairs any non-transient failure after 90 seconds • Web page size is constant • Faults are independent • Each client request is independent of all others (no sessions!) • Request arrival times are determined by a Poisson process (not self-similar) • Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs
Sample result: app crash Perf-PRESS HA-PRESS Throughput Latency
Sample result: node hang Perf-PRESS HA-PRESS Throughput Latency
Representing latency • Total seconds of wait time • Not good for comparing cases with different workloads • Average (mean) wait time per request • OK, but requires that expected (normal) response time be given separately • Variance of wait time • Not very intuitive to describe. Also, read-only workload means that all variance is toward longer wait times anyway
Representing latency (2) • Consider “goodput”-based availability: total responses served total requests • Idea: Latency-based “punctuality”: ideal total latency actual total latency • Like goodput, maximum value is 1 • “Ideal” total latency:average latency for non-fault cases x total #requests (shouldn’t be 0)
Representing latency (3) • Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience) • Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec)
Other metrics • Data quality, latency and throughput are interrelated • Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”? • To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote)1 • These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available 1 Zona Research and Keynote Systems, The Need for Speed II, 2001
Apphang Appcrash Nodecrash Nodefreeze Linkdown Sample demerit system • Rules: • Each aborted (2s) conn: 2 demerits • Each conn error: 1 demerit • Each user timeout (8s): 8 demerits • Each sec of total latency above ideal level:(1 demerit/total #requests) x scaling factor
Cheap, robust& fast (optimal) Cheap, fast& flaky Expensive,robust and fast Cheap &robust, but slow Expensive,fast & flaky Expensive &robust, but slow Cost of operations &components Online service optimization Performance metrics: throughput, latency & data quality Environment: workload & faults
Conclusions • Latency-based punctuality and throughput-based availability give similar results for a read-only web workload • Applied workload is very important • Reliability metrics do not (and should not) reflect maximum performance/workload! • Latency did not degrade gracefully in proportion to workload • At high loads, PRESS “oscillates” between full service, 100% load shedding
Further Work • Combine test results & predicted component failure rates to get long-term performability estimates (are these useful?) • Further study will benefit from more sophisticated client & workload simulators • Services that generate dynamic content should lead to more interesting data (ex: RUBiS)
Latency as a Performability Metric: Experimental Results Pete Broadwell pbwell@cs.berkeley.edu
Example: long-term model Discrete-time Markov chain (DTMC) model of a RAID-5 disk array1 D=number of data disks pi(t)=probability that system is in state i at time t wi(t) =reward (disk I/O operations/sec) m = disk repair rate l = failure rate of a single disk drive 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997