Toward Recovery-Oriented Computing

Toward Recovery-Oriented Computing Armando Fox, Stanford UniversityDavid Patterson, UC Berkeleyand a cast of tens

Outline • Whither recovery-oriented computing? • research/industry agenda of last 15 years • today’s pressing problem: availability (we knew that) - but what is new/different compared to previous F/T work, databases, etc? • Recovery-Oriented Computing as an approach to availability • Motivation and philosophy • sampling of research avenues • what ROC is not

Reevaluating goals & assumptions • Goals of last 15 years • Goal #1: Improve performance • Goal #2: Improve performance • Goal #3: Improve cost-performance • Assumptions • Humans are perfect (they don’t make mistakes during installation, wiring, upgrade, maintenance or repair) • Software will eventually be bug free (good programmers will write bug-free code, debugging works) • Hardware MTBF is already very large (~100 years between failures), and will continue to increase

Results of this successful agenda • Good news: faster computers, denser disks, cheaper $ • computation faster by >3 orders of magnitude • disk capacity greater by >3 orders of magnitude • Result: TCO dominated by administration, not hardware cost • Bad news: complex, brittle systems that fail frequently • 65% of IT managers report that their websites were unavailable to customers over a 6-month period (25%: 3 or more outages) [Internet Week, 4/3/2000] • outage costs: negative press, “click overs” to competitor, stock price, market cap… • Yet availability is key metric for online services!

Direct Downtime Costs (per Hour) Brokerage operations $6,450,000 Credit card authorization $2,600,000 Ebay (22 hour outage) $225,000 Amazon.com $180,000 Package shipping services $150,000 Home shopping channel $113,000 Catalog sales center $90,000 Airline reservation center $89,000 Cellular service activation $41,000 On-line network fees $25,000 ATM service fees $14,000 Sources: InternetWeek 4/3/2000 + Fibre Channel: A Comprehensive Introduction, R. Kembel 2000, p.8. ”...based on a survey done by Contingency Planning Research."

So, what are today’s challenges? • We all seem to agree on goals • Dave Patterson, IPTS 2002: ACME “availability, change, maintenance, evolution” • Jim Gray, HPTS 2001: FAASM “functionality, availability, agility, scalability, manageability” • Butler Lampson, SOSP 1999: “Always available, evolving while they run, growing without practical limit” • John Hennessy, FCRC 1999: “Availability, maintainability and ease of upgrades, scalability” • Fox & Brewer, HotOS 1997: BASE “best-effort service, availability, soft state, eventual consistency” • We’re all singing the same tune, but what is new?…

What’s New and Different • Evolution and change are integral • not true of many “traditional” five nines systems: long design cycle, changes incur high overhead for design/spec/testing • Last version of space shuttle software: 1 bug in 420 KLOC, cost $35M/yr to maintain (good quality commercial SW: 1 bug/KLOC) • But, recent upgrade for GPS support required generating 2,500 pages of specs before changing anything in 6.3 KLOC (1.5%) • Performance still important, but focus changed • Interactive performance and availability to end users is key • Users appear willing to occasionally tolerate temporary degradation (“service quality”) in exchange for improved availability • How to capture this tradeoff: soft/stale state, partial performance degradation, imprecise answers…

ROC Philosophy • ROC philosophy (“Peres’s Law”): “If a problem has no solution, it may not be a problem, but a fact; not to be solved, but to be coped with over time”Shimon Peres • Failures (hardware, software, operator-induced) are a fact; recovery is how we cope with them over time • Availability = MTTF/MTBF= MTTF / (MTTF + MTTR) Rather than just making MTTF very large, make MTTR << MTTF • Why? • Human errors will still cause outages => minimize recovery time • Recovery time is directly measurable, and directly captures impact on users of a specific outage incident (MTTF doesn’t) • Rapid evolution makes exhaustive testing/validation impossible => unexpected/transient failures will still occur

1. Human Error Is Inevitable • Human error major factor in downtime… • PSTN: Half of all outage incidents and outage-minutes from 1992-1994 were due to human error (including errors by phone company maintenance workers) • Oracle: up to half of DB failures due to human error (1999) • Microsoft blamed human error for ~24-hour outage in Jan 2001 • Approach: • Learn from psychology of human error and disaster case studies • Build in system support for recovery from human errors • Use tools such as error injection, virtual machine technology to provide “flight simulator” training for operators

The 3R undo model • Undo == time travel for system operators • Three R’s for recovery • Rewind: roll system state backwards in time • Repair: change system to prevent failure • e.g., edit history, fix latent error, retry unsuccessful operation, install preventative patch • Replay: roll system state forward, replaying end-user interactions lost during rewind • All three R’s are critical • rewind enables undo • repair lets user/administrator fix problems • replay preserves updates, propagates fixes forward

Example e-mail scenario • Before undo: • virus-laden message arrives • user copies it into a folder without looking at it • Operator invokes undo (rewind) to install virus filter (repair) • During replay: • message is redelivered but now discarded by virus filter • copy operation is now unsafe (source message doesn’t exist) • compensating action: insert placeholder for message • now copy command can be executed, making history replay-acceptable

First implementation attempt • Undo wrapper for open source IMAP email store 3R Layer StateTracker Email Server Includes: - user state - mailboxes - application - operating system SMTP SMTP 3RProxy IMAP IMAP Non-overwritingStorage UndoLog control

3. Handling Transient Failures via Restart • Many failures are either (a) transient and fixable through reboot, or (b) non-transient, but reboot is the lowest-MTTR fix • Recursive Restarts: To minimize MTTR, restarts the minimal set of subsystems that could cure a failure; if that doesn’t help, restart the next-higher containing set, etc. • Partial restarts/reboots • Return system (mostly) to well-tested, well-understood start state • High confidence way to reclaim stale/leaked resources • Unlike true checkpointing, reboot more likely to avoid repeated failure due to corrupted state • We focus on proactive restarts; can also be reactive (SW rejuvenation) • “Easier to run a system 365 times for 1 day than 365 days” • Goals: • What is the software structure that can best accommodate such failure management while still preserving all other requirements (functionality, performance, consistency, etc.) • Develop methodology for building and managing RR systems (concrete engineering methods) • Develop the tools for building, testing, deploying, and managing RR systems • Design for fast restartability in online-service building blocks

A Hierarchy of Restartable Units • Siblings highly fault-isolated • low level: by high-confidence, low-level, HW-assisted machinery, (eg MMU, physical isolation) • higher level: by VM-level abstractions based on the above machinery (eg JVM, HW VM, process) • R-map (=hierarchy of restartable component groups) captures restart dependencies • Groups of restart units can be restarted by common parent • Restarting a node restarts everything in its subtree • A failure is minimally curable at a specific node • Restarts farther up tree are more expensive, but higher confidence for curing transients

RR-ifying a satellite ground station • Biggest improvement: MTTF/MTTR-based boundary redrawing • Ability to isolate unstable components without penalizing whole system • Achieve a balanced MTTF/MTTR ratio across components at the same level • Lower MTTR may be strictly better than higher MTTF • unplanned downtime is more expensive than planned downtime, and downtime under a heavy/critical workload (e.g., satellite pass) is more expensive than downtime under a light/non-critical workload. • high MTTF doesn’t guarantee failure-free operation interval, but sufficiently low MTTR may mitigate impact of failure • Current work is applying RR to a ubiquitous computing environment, a J2EE application server, and an OSGI-based platform for cars  new lessons will emerge (e.g., r-tree needs to be a r-DAG) • Most of these lessons are not surprising, but RR provides a uniform framework within which to discuss them

MTTR Captures Outage Costs • Recent software-related outages at Ebay: 4.5 hours in Apr02, 22 hours Jun99, 7 hours May99, 9 hours Dec98 • Assume two 4-hour (“newsworthy”) outages/year • A=(182*24 hours)/(182*24 + 4 hours) = 99.9% • Dollar cost: Ebay policy for >2 hour outage, fees credited to all affected users (US$3-5M for Jun99) • Customer loyalty: after Jun99 outage, Yahoo Auctions reported statistically significant increase in users • Ebay’s market cap dropped US$4B after Jun99 outage, stock price dropped 25% • Newsworthy due to number of users affected, given length of outage

Outage costs, cont. • What about a 10-minute outage once per week? • A=(7*24 hours)/(7*24 + 1/6 hours) = 99.9% - the same • Can we quantify “savings” over the previous scenario? • Shorter outages affect fewer users at a time • Typical AOL email “outage” affects 1-2% of users • Many short outages may affect different subsets of users • Shorter outages typically not news-worthy

When Low MTTR Trumps High MTTF • MTTR is directly measurable; MTTF usually not • Component MTTF’s -> tens of years • Software MTTF ceiling -> ~30 yrs (Gray, HDCC 01) • Result: “measuring” MTTF requires 100’s of system-years • But, MTTR’s are minutes to hours, even for complex SW components • MTTR more directly captures impact of a specific outage • Very low MTTR (~10 seconds) achievable with redundancy and failover • Keeps response time below user threshold of distraction [Miller 1968, Bhatti et al 2001, Zona Research 1999]

Degraded Service vs. Outage • How about longer MTTR’s (minutes or hours)? • Can service be designed so that “short” outages appear to users as temporary degradation instead? • How much degradation will users tolerate? • For how long (until they abandon the site because it feels like a true outage - abandonment can be measured) • How frequently? • Even if above thresholds can be deduced, how to design service so that transient failures can be mapped onto degraded quality?

Examples of degraded service • Goal: derive a set of service “primitives” that directly reflect parameterizable degradation due to transient failure (“theory” is too strong…)

Two Frequently Asked Questions • Is ROC the same as autonomic computing™? • Are you saying we should build lousy hardware and software and mask all those failures with ROC mechanisms?

1. Does ROC==autonomic computing? • Self-administering? • For now, focus on empowering administrators, not eliminating them • Humans are good at detecting and learning from own mistakes, so why not? (avoiding automation irony) • We’re not sure we understand sysadmins’ current techniques well enough to think about automation • Self-healing, self-reprovisioning, self-load-balancing…? • Sure - Web services and datacenters already do this for many situations; many techniques and tools are “well known” • But - do we know how (“theory”) to design the app software to make these techniques possible • Digital immune system - it’s in WinXP

2. What ROC is not • We do not advocate for… • producing buggy software • building lousy hardware • slacking on design, testing, or careful administration • discarding existing useful techniques or tools • We do advocate for… • an increased focus on lowering MTTR specifically • increased examination of when some guarantees can be traded for lower MTTR • systematic exploration of “design for fast recovery” in the context of a variety of applications • stealing great ideas from systems, Internet protocols, psychology, safety-critical systems design

Summary: ROC and Online Services • Current software realities lead to new foci • Rapid evolution => traditional FT methodologies difficult to apply • Human error inevitable, but humans are good at identifying own errors => provide facilities to allow recovery from these • HW and SW failure inevitable => use redundancy and designed-in ability to substitute temporary degradation for outages (“design for recovery”) • Trying to stay relevant via direct contact with designers/operators of large systems • Need real data on how large systems fail • Need real data on how different kinds of failures are perceived by users

Interested in ROCing? • Are you willing to anonymously share failure data? • Already great relationships (and in some cases data-sharing agreements) with BEA, IBM, HP, Keynote, Microsoft, Oracle, Tellme, Yahoo!, others • See http://roc.stanford.edu orhttp://roc.cs.berkeley.edu for publications, talks, research areas, etc. • Contact Armando Fox (fox@cs.stanford.edu) or Dave Patterson (patterson@cs.berkeley.edu)

Discussion Question • [For discussion] So what if you pick the low hanging fruit? The challenge is in reaching the highest leaves.

Toward Recovery-Oriented Computing

Toward Recovery-Oriented Computing

Presentation Transcript

Recovery Oriented Software

Recovery Oriented Practice

Acceptability-Oriented Computing

Service Oriented Computing

Recovery-Oriented Computing

Building Recovery Oriented Services

Recovery-Oriented Computing User Study

Recovery Oriented Computing Embracing Failure

Recovery Oriented Computing (ROC)

Recovery Oriented Computing (ROC)

Recovery Oriented Computing

Recovery Oriented Prescribing

Recovery-Oriented Computing

Recovery Oriented Computing (ROC)

ROC Solid: A Recovery Oriented Computing Perspective

Recovery Oriented Computing (ROC)

Recovery-Oriented Computing