210 likes | 287 Views
Exploring Failure Transparency and the Limits of Generic Recovery. Dave Lowell Compaq Western Research Lab xxx Subhachandra Chandra and Peter M. Chen, University of Michigan. Introduction. Failure transparency: abstraction of failure-free operation
E N D
Exploring Failure Transparency and the Limits of Generic Recovery Dave LowellCompaq Western Research Labxxx Subhachandra Chandra andPeter M. Chen, University of Michigan
Introduction • Failure transparency: abstraction of failure-free operation • OS recovers app after hardware, OS, and application failures • No programmer help • No slow down • Will explore theory, performance, and limitations
Consistent recovery • Visible output equivalent to failure-free run • equivalence: allows duplicates • avoids “exactly once” problem • Failure transparency consistent recovery with generic techniques
Guaranteeing consistent recovery • Key players: non-deterministic events, visible events, commit events • Save-work invariant (simplified): • There’s a commit after each non-deterministic event that happens-before a visible event. • Full theorem handles liveness, distinguishes causality and ordering
Commit All CAND CAND-LOG Effort to identify/convert ND events
CPV-2PC CBNDV-2PC CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to commit only visible events Effort to identify/convert ND events
Manetho Coord. Checkpointing Optimistic Logging Targon/32 Hypervisor SBL CPV-2PC CBNDV-2PC Effort to commit only visible events CBNDVS CBNDVS-LOG CPVS CAND CAND-LOG Effort to identify/convert ND events
increasing simplicity application failure recovery increasing recovery time increasing performance Effort to commit only visible events Effort to identify/convert ND events
Performance study • Discount Checking: fast checkpoints to reliable memory (Rio) • Logging and two-phase commit • Disk version • Mostly interactive applications • Localized and distributed
Nvi Text Editor Effort to commit only visible events CBNDVS1%42% CBNDVS-LOG0%12% CPVS1%44% CAND1%43% CAND-LOG0%13% Effort to identify/convert ND events
TreadMarks Barnes-Hut CPV-2PC12%319% CBNDV-2PC12% 252% Effort to commit only visible events CBNDVS101%5743% CBNDVS-LOG73%4973% CPVS129%7346% CAND199%11499% CAND-LOG126%7700% Effort to identify/convert ND events
Have only considered “stop” failures • Committing everything is okay • Save-work: when we must commit • Some failures affect application state • Can we commit too much?
Lose-work invariant • To recover from propagation failure, never commit on a “dangerous path”. • Save-work and Lose-work conflict! • Visible event on dangerous path • Can’t guarantee consistent recovery from propagation failures • Do we see this conflict in practice?
Measuring Lose-work violations • Fault-injection study : OS crashes • injected faults into running kernel • induced 350 OS crashes • recovered nvi and postgres using Discount Checking • Results • nvi: 15% crashes violate Lose-work • postgres: 3% crashes violate Lose-work
Application crashes • Fault-injection study: ND bugs • nvi: 37% violate Lose-work • postgres: 33% violate Lose-work • Published bug distributions: 85-95% of application bugs are deterministic • intrinsically violate Lose-work • Perhaps > 90% app crashes violate Lose-work!
Conclusions • Save-work and Lose-work invariants • Save-work protocol space • Invariants fundamentally conflict • Failure transparency performance: • 0-12% overhead on reliable memory • 13-40% overhead on disk (interactive apps) • > 90% application failures violate Lose-work