Latency as a Performability Metric for Internet Services

Latency as a Performability Metric for Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Outline • Performability background/review • Latency-related concepts • Project status • Initial test results • Current issues

9 9 9 9 9 Motivation • A goal of ROC project: develop metrics to evaluate new recovery techniques • Problem: basic concept of availability assumes system is either “up” or “down” at a given time • “Nines” only describe fraction of uptime over a certain interval

Why Is Availability Insufficient? • Availability doesn’t describe durations or frequencies of individual outages • Both can strongly influence user perception of service, as well as revenue • Availability doesn’t capture system’s capacity to support degraded service • degraded performance during failures • reduced data quality during high load (Web)

What is “performability”? • Combination of performance and dependability measures • Classical defn: probabilistic (model-based) measure of a system’s “ability to perform” in the presence of faults1 • Concept from traditional fault-tolerant systems community, ca. 1978 • Has since been applied to other areas, but still not in widespread use 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994

D=number of data disks pi(t)=probability that system is in state i at time t wi(t) =reward (disk I/O operations/sec) m = disk repair rate l = failure rate of a single disk drive Performability Example Discrete-time Markov chain (DTMC) model of a RAID-5 disk array1 1 Hannu H. Kari, Ph.D. Thesis, Helsinki University of Technology, 1997

FAILURE Normal throughput RECOVER Average throughput REPAIR DETECT Degraded throughput Visualizing Performability Throughput I/O operations/sec Time

Perf Time Metrics for Web Services • Throughput - requests/sec • Latency – render time, time to first byte • Data quality • harvest (response completeness) • yield (% queries answered)1 1 E. Brewer, Lessons from Giant-Scale Internet Services, 2001

Applications of Metrics • Modeling the expected failure-related performance of a system, prior to deployment • Benchmarking the performance of an existing system during various recovery phases • Comparing the reliability gains offered by different recovery strategies

Related Projects • HP: Automating Data Dependability • uses “time to data access” as one objective for storage systems • Rutgers: PRESS/Mendosus • evaluated throughput of PRESS server during injected failures • IBM: Autonomic Storage • Numerous ROC projects

Arguments for Using Latency as a Metric • Originally, performability metrics were meant to capture end-user experience1 • Latency better describes the experience of an end user of a web site • response time >8 sec = site abandonment = lost income $$2 • Throughput describes the raw processing ability of a service • best used to quantify expenses 1 J. F. Meyer, Performability Evaluation: Where It Is and What Lies Ahead, 1994 2 Zona Research and Keynote Systems, The Need for Speed II, 2001

Current Progress • Using Mendosus fault injection system on a 4-node PRESS web server (both from Rutgers) • Running latency-based performability tests on the cluster • Inject faults during load test • Record page-load times before, during and after faults

Cachinginfo Request Response Test Setup PRESS web server + Mendosus Test clients Page Emulatedswitch Normal version: cooperative caching HA version: cooperative caching + heartbeat monitoring

Effect of Component Failure on Performability Metrics Perform- ability metric Throughput Latency Time FAILURE REPAIR

Observations • Below saturation, throughput is more dependent on load than latency • Above saturation, latency is more dependent on load Thru = 3/s Lat = .14s Thru = 6/s Lat = .14s Thru = 7/s Lat = .4s 1 2 3 4 5 Time

How to Represent Latency? • Average response time over a given time period • Make a distinction between “render time” & “time to first byte”? • Deviation from baseline latency • Impose a greater penalty for deviations toward longer wait times?

X users get “server too busy” msg Response Time with Load Shedding Policy Responsetime (sec) Abandonment threshold 8s Load-shedding threshold Time FAILURE REPAIR

Load Shedding Issues • Load shedding means returning 0% data quality – a different kind of performability metric • To combine load shedding and latency, define a “demerit” system: • Such systems quickly lose generality, however - “Server too busy” msg – 3 demerits - 8 sec response time – 1 demerit/sec

Further Work • Collect more experimental results! • Compare throughput and latency-based results of normal and high-availability versions of PRESS • Evaluate usefulness of “demerit” systems to describe the user experience (latency and data quality)

Latency as a Performability Metric for Internet Services Pete Broadwell pbwell@cs.berkeley.edu

Latency as a Performability Metric for Internet Services

Latency as a Performability Metric for Internet Services

Presentation Transcript

Safety as a Software Metric

Latency as a Performability Metric: Experimental Results

Predicting Communication Latency in the Internet

Metrics and Techniques for Evaluating the Performability of Internet Services

A Quest for an Internet Video Quality-of-Experience Metric

A Metric for Software Readability

A Quest for an Internet Video Quality-of-Experience Metric

LASTor : A Low-Latency AS-Aware Tor Client

Internet as a Corpus

Performability of Web-based Applications

USABILITY AS A QUALITY METRIC

Rationality as a Paradigm for Internet Computing

Internet as a Dynamic System

Latency

INTERNET AS A NEWS SOURCE

Internet: As a learning tool

Trading Latency for Composability

Dollars as Metric

King : A tool to estimate latency between Internet hosts

A Differentiated Services Architecture for the Internet