160 likes | 283 Views
Why Recovery Should Be Free, And Often Can Be. Armando Fox , Stanford University June 2003 ROC Retreat. Recovery Should Be Free, and Can Be. Already espouse arguments about lowering MTTR: Mitigates impact on service as a whole [Fox & Patterson, 2002]
E N D
Why Recovery Should Be Free,And Often Can Be Armando Fox, Stanford University June 2003 ROC Retreat
Recovery Should Be Free, and Can Be • Already espouse arguments about lowering MTTR: • Mitigates impact on service as a whole [Fox & Patterson, 2002] • Results in higher end-user-perceived availability, given same overall availability [Xie et al. 2002] • etc • Tim Chou, Oracle: maybe more important to make recovery predictable (so can plan provisioning, anticipate impact of outage, etc.)...if we understand it, we can optimize its speed
Real win: Recovery management is hard • Determining when to recover is hard • How to detect that something’s wrong? • How do you know when recovery is really necessary? (fail-stutter, etc.) • Will recovery make things worse? (cascading recovery) • Knowing what happens when you recover is hard • Will a particular recovery technique work? (the machinery needed to perform the recovery may also be broken) • What is the effect on online performance? (recovery can be expensive) • What if you needlessly “over-recover”? (cost of making a mistake is high) • If recovery were predictable and fast, it would simplify both failure detection and recovery management.
Simplifying Recovery Management: Crash-Only Software • Goal: enforce simple invariants on recovery behavior, from outside the component(s) being recovered • Crash-only component provides PWR switch: stop = crash: • clean shutdown = loss of power = kernel panic = ... • One way to go down one way to come up: start = recover • Power switch is external uniform behavior • kill-9, “turning off” (process kill) a VM, pull power cord • Intuition: the “infrastructure” supporting the power switch is usually simpler than the applications using it, and common across all those applications • Can crash-only software actually be built, and if so, how? • (a) provide building blocks • (b) formalize C/O definition and provide developer
Crash-only Building Blocks • JAGR/ROC-2, a self-recovering J2EE app server [Candea et al., WIAPP 2003] • Micro-reboots used for recovery, application-generic failure-path inference used for determining recovery strategy • Significantly improves performability relative to whole-app redeploy • SSM: a CO session state manager [Ling, Fox, AMS 2003] • DStore: a CO persistent single-key state manager [Huang, Fox, submitted to SRDS 2003] • Similar in spirit to HP Labs FAB [Frolund, Saito et al., 2003] • Common features of both SSM and DStore: • Redundancy used for persistence • Workload semantics exploited to simplify consistency model & recovery • Recovery=restart, safe to reboot any node at any time • Safe to coerce any failure to a crash (fail-stop) at any time
Building blocks, cont. • Pinpoint, statistical-anomaly-based failure detection • Standard tension: accuracy vs. precision (false positives problem) • Different clustering techniques seem to be good at detecting different kinds of problems • Surprising result from a CS241 project: character-frequency histograms are a good app-generic way to detect end-user-visible failures • Mostly integrated with JAGR and SSM • On burner: discussions with BEA Systems for integrating into WebLogic Server • Insight: if cost of “over-recovering” is low, aggressive statistics-based failure detection becomes more appealing
Toward a crash-only formalism • Component frameworks force you into certain app-writing patterns • Inter-EJB calls through runtime-managed level of indirection • Restrictions on how persistent state mgt can be expressed • Restrictions on state sharing: difficult to do without using explicit external store • Hypothesis: these are the elements that allow C/O to work • Ongoing work: formalize crash-only SW • One possibility: observational equivalence with respect to a request stream • Can be expressed using a design pattern or denotational semantics • Ideally, will lead to a tool (“co-lint”) telling you whether your component is crash-only
Summary: Toward a Crash-only World • Goal: simplify recovery management • diagnosis: statistical methods even more appealing if the cost of making a mistake is low • recovery: crash-only enforces invariants about what happens when recovery is attempted • allows aggressive use of fault model enforcement [Martin et al 2002] • Good progress on providing building blocks for app writers • JAGR: J2EE app server that allows fast recovery via micro-reboots and application-generic fault injection • SSM: a crash-only session state store (in process of integrating with JAGR) • DStore: a crash-only persistent single-key store • PinPoint: statistics-based failure detection (integrated with JAGR, mostly integrated with SSM)
Xie et al: MTTR and End-User Availability Let AU=user-perceived unavailability, AS=system unavailability • Hypothesis: if users retry failed requests, and retry succeeds because system had fast recovery, they will perceive higher availability • When retry rate is sufficiently frequent, AU approaches AS (for AS =99.3%, this threshold is 200-300 sec) • Method: model user retry behavior and system failure/recovery using Markov models; solve using numerical methods • Finding: Given 2 systems with same AS, the one with shorter MTTR (even though it also has lower MTTF) appears better to the user. • Goal of this project: validate that result empirically (Jeff Raymakers, Yee-Jiun Song, Wendy Tobagus)
User perceived unavailability vs retry rate Higher user retry rates yields little improvement in perceived availability. “sweet spot”
Surprise! MTTF eventually catches up with you At low MTTR, lowering MTTR and MTTF at the same time results in worse user perceived unavailability! “sweet spot” Variable MTTR, but fixed system availability (low MTTR -> low MTTF)
Optimization Choices User Perceived Unavailability Fixed MTTF Fixed MTTR System Unavailability
Results Summary • We can find a “sweet spot” (for a given system availability) beyond which higher user retry rates yield little benefit. • For two systems of a given availability, the one with lower MTTR does not always yield better user perceived availability. • For a given system, we can determine whether improving MTTR or MTTF will yield more user-visible benefits.
“Clean” shutdown vs. restart? • Impractical to guarantee zero crashes robust systems must be crash-safe anyway • In that case, why support any other kind of shutdown? • Historically, for performance (avoid synchronous writes, do buffering/caching, etc) - leads to replicated/mirrored state, more code, special recovery code paths... • Total recovery time may be shorter evenif crash is forced • WinXP can be (mostly) crash-rebooted for upgrades • VMS sysadmins would sometimes crash the system rather than shut it down (if no users were logged on) Crash-only software must:(a) be crash-safe & (b) recover quickly
Why Crash-Only Simplifies Recovery • “Hardware works, software doesn’t” • Hardware interlocks, timers, etc. have small state spaces of behavior, hence high confidence they will work as designed • Crash-only PWR switch is a way to approach that same property for software • Crash-only makes recovery policies easier to reason about • Opportunity to aggressively apply SW rejuvenation • “Recovery” code exercised on every restart; no exotic-but-rarely-used code paths • “Over-recovery” may be OK from performability standpoint: if recovery is free (performance & correctness), you stop thinking about it as recovery and start thinking about it as normal aspect of operation
Towards a Crash-Only World • Existing software that is crash-only or near-crash-only • Stateless apps: most Web servers • Most RDBMS’s: crash-safe, but long recovery • Postgres, BerkeleyDB/Sleepycat: “recovery” codepath is the main codepath • Some appliance storage devices: separate but pretty fast recovery path • Our goals... • Focus on Internet (“3 tier”) applications; already “crash-mostly” except for persistence tier(s) • Make the app server, middle-tier persistence, and back-end tier (to the extent possible) truly crash-only • Deploy application-generic failure detection techniques (which may over-recover, but the goal is to make that OK) • Quantify improvement (we hope!) in performability resulting from these changes • By doing it in the middleware, any app on that middleware can benefit