1 / 25

Latency as a Performability Metric: Experimental Results

Latency as a Performability Metric: Experimental Results. Pete Broadwell pbwell@cs.berkeley.edu. Outline. Motivation and background Performability overview Project summary Test setup PRESS web server Mendosus fault injection system Experimental results & analysis

maik
Download Presentation

Latency as a Performability Metric: Experimental Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latency as a Performability Metric: Experimental Results Pete Broadwell pbwell@cs.berkeley.edu

  2. Outline • Motivation and background • Performability overview • Project summary • Test setup • PRESS web server • Mendosus fault injection system • Experimental results & analysis • How to represent latency • Questions for future research

  3. Performability overview • Goal of ROC project: develop metrics to evaluate new recovery techniques • Performability – class of metrics to describe how a system performs in the presence of faults • First used in fault-tolerant computing field1 • Now being applied to online services 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

  4. Example: microbenchmark RAID disk failure

  5. Project motivation • Rutgers study: performability analysis of a web server, using throughput • Other studies (esp. from HP Labs Storage group) also use response time as a metric • Assertion: latency and data quality are better than throughput for describing user experience • How best to represent latency in performability reports?

  6. Project overview • Goals: • Replicate PRESS/Mendosus study with response time measurements • Discuss how to incorporate latency into performability statistics • Contributions: • Provide a latency-based analysis of a web server’s performability (currently rare) • Further the development of more comprehensive dependability benchmarks

  7. Experiment components • The Mendosus fault injection system • From Rutgers (Rich Martin) • Goal: low-overhead emulation of a cluster of workstations, injection of likely faults • The PRESS web server • Cluster-based, uses cooperative caching. Designed by Carreira et al. (Rutgers) • Perf-PRESS: basic version • HA-PRESS: incorporates hearbeats, master node for automated cluster management • Client simulators • Submit set # of requests/sec, based on real traces

  8. Mendosus design Workstations (real or VMs) Global Controller (Java) ModifiedNICdriver SCSImodule procmodule Apps config file LAN emu config file Fault config file User-leveldaemon (Java) apps Emulated LAN

  9. Experimental setup

  10. Fault types

  11. Test case timeline - Warm-up time: 30-60 seconds - Time to repair: up to 90 seconds

  12. Simplifying assumptions • Operator repairs any non-transient failure after 90 seconds • Web page size is constant • Faults are independent • Each client request is independent of all others (no sessions!) • Request arrival times are determined by a Poisson process (not self-similar) • Simulated clients abandon connection attempt after 2 secs, give up on page load after 8 secs

  13. Sample result: app crash Perf-PRESS HA-PRESS Throughput Latency

  14. Sample result: node hang Perf-PRESS HA-PRESS Throughput Latency

  15. Representing latency • Total seconds of wait time • Not good for comparing cases with different workloads • Average (mean) wait time per request • OK, but requires that expected (normal) response time be given separately • Variance of wait time • Not very intuitive to describe. Also, read-only workload means that all variance is toward longer wait times anyway

  16. Representing latency (2) • Consider “goodput”-based availability: total responses served total requests • Idea: Latency-based “punctuality”: ideal total latency actual total latency • Like goodput, maximum value is 1 • “Ideal” total latency:average latency for non-fault cases x total #requests (shouldn’t be 0)

  17. Representing latency (3) • Aggregate punctuality ignores brief, severe spikes in wait time (bad for user experience) • Can capture these in a separate statistic (EX: 1% of 100k responses took >8 sec)

  18. Availability and punctuality

  19. Other metrics • Data quality, latency and throughput are interrelated • Is a 5-second wait for a response “worse” than waiting 1 second to get a “try back later”? • To combine DQ, latency and throughput, can use a “demerit” system (proposed by Keynote)1 • These can be very arbitrary, so it’s important that the demerit formula be straightforward and publicly available 1 Zona Research and Keynote Systems, The Need for Speed II, 2001

  20. Apphang Appcrash Nodecrash Nodefreeze Linkdown Sample demerit system • Rules: • Each aborted (2s) conn: 2 demerits • Each conn error: 1 demerit • Each user timeout (8s): 8 demerits • Each sec of total latency above ideal level:(1 demerit/total #requests) x scaling factor

  21. Cheap, robust& fast (optimal) Cheap, fast& flaky Expensive,robust and fast Cheap &robust, but slow Expensive,fast & flaky Expensive &robust, but slow Cost of operations &components Online service optimization Performance metrics: throughput, latency & data quality Environment: workload & faults

  22. Conclusions • Latency-based punctuality and throughput-based availability give similar results for a read-only web workload • Applied workload is very important • Reliability metrics do not (and should not) reflect maximum performance/workload! • Latency did not degrade gracefully in proportion to workload • At high loads, PRESS “oscillates” between full service, 100% load shedding

  23. Further Work • Combine test results & predicted component failure rates to get long-term performability estimates (are these useful?) • Further study will benefit from more sophisticated client & workload simulators • Services that generate dynamic content should lead to more interesting data (ex: RUBiS)

  24. Latency as a Performability Metric: Experimental Results Pete Broadwell pbwell@cs.berkeley.edu

  25. Example: long-term model Discrete-time Markov chain (DTMC) model of a RAID-5 disk array1 D=number of data disks pi(t)=probability that system is in state i at time t wi(t) =reward (disk I/O operations/sec) m = disk repair rate l = failure rate of a single disk drive 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997

More Related