ROC@Stanford Progress Report

ROC@Stanford Progress Report Armando Foxwith George Candea, James Cutler, Ben Ling, Andy Huang

Philosophical Direction • Use only dynamic, observed behavior to determine recovery technique/policy • Application independent recovery techniques • Specialize designs for fast recovery • Putting it all together: all software should be crash-only

Dynamic, Observed Behavior • A priori fault models are suspect. Base recovery strategy only on dynamically observed behavior. • Behavior may change as system or workload evolves => addresses a key difference between Internet-oriented ROC systems and traditional mission-critical systems • Kinds of observations • PinPoint: use statistical analysis to determine which groups of components are correlated with observed external faults • Automatic failure-propagation inference: use fault injection and tracing to determine propagation paths and extent of different kinds of faults

Making techniques application-generic • True application-generic recovery is hard [Lowell & Chen] • But that’s because “generic” applications are too unconstrained • Idea: if an application uses a particular “rich runtime”, that runtime may constrain application structure • Example: J2EE, a widely used enterprise app. framework • Modular Java applications, well defined component boundaries • Rich runtime system (“application server”) provides services for deployment/undeployment, naming, load balancing, integration with Web servers & databases, etc. • Instrument the platform with generic methods for fault injection and recovery (e.g., using Recursive Restartability) • Generic mechanisms: timeouts, exception propagation • Parametrizable mechanisms: progress counters, application-level pings

Example: Automatic Failure Propagation Inference • When a failure occurs in a particular software component of an application, how far does it propagate? • i.e., what part(s) of the application must be recovered • Traditionally, failure propagation information is derived by hand • Our approach: modify J2EE application server to allow capture of failure-propagation information in any J2EE app • Automatic Failure-Propagation Inference (AFPI) for JBoss: + automatically and dynamically generates f-maps with no performance overhead + no application knowledge required + finds dependencies that other analyses might miss,omits “false” dependencies that don’t result in actual failure propagation

Design for Fast Recovery • Recursive Restartability as a technique for recovery assumes... • For correctness: All components are independent and restartable (ie no data loss or other bad effects) • For performance: Restarts are relatively fast • For stateless components, this is “easy”; what about stateful components? • Correctness: eg, filesystems may suffer data loss if OS not cleanly shut down • Performance: eg, commercial RDBMS’s are crash-safe, but take a long time (minutes to hours) to recover

Fast-Recovering State Stores • Isolate state exclusively in state store components; make all other “application logic” components stateless • Instead of building a general state store, specialize it for its intended use • Goal: identify combination of specializations that facilitates construction of a very-large-scale state store (O(103) requests/sec on O(106) entries) with near-zero recovery time • Possible axes for specialization… • Is state shared across clients or not? (user profile/session state vs. updating a message board) • How powerful must the query API be? (single-key lookup, free-text search, fully relational…) • What is the intended lifetime of state? (short/session, long/forever)

Putting it together: crash-only software • Already assumed: software must be able to recover from a crash rapidly and correctly • But if it can do that…then why include separate code paths for “clean shutdown”? • All software should be crash-only; this makes it robust, easy to administer/upgrade, and amenable to RR as a recovery technique (among others) • Current explorations: • RR-ifying the platform (J2EE appserver) vs. individual applications • Improving ability to detect anomalies and failure correlations using path-based statistical analysis • Designing crash-only state stores for both session state and persistent state

Outrageous Opinions session tomorrow • tomorrow after dinner: controversial ideas/opinions, open challenges, predicting the future, ... • Please sign up on easel (coming this afternoon) • ~5-8 minutes per person to pound the pulpit and stimulate later discussion • Retreat proceedings, slides, etc. (mostly) online • Internet keyword “retreat” :-) or http://retreat or 10.0.0.1

ROC@Stanford Progress Report

ROC@Stanford Progress Report

Presentation Transcript

Stanford ROC Updates

Progress Report

Progress Report

Progress Report

Progress Report

Progress Report

Progress Report

Progress Report

Progress Report

Progress Report

Progress Report

Progress Report

Water System Design for Stanford University Green Dorm: Progress Report

Progress Report

Progress Report

Irish ROC - Report

Progress Report

PROGRESS REPORT

Progress Report

Progress Report