90 likes | 235 Views
ROC@Stanford Progress Report. Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang. Philosophical Direction. Use only dynamic, observed behavior to determine recovery technique/policy Application independent recovery techniques Specialize designs for fast recovery
E N D
ROC@Stanford Progress Report Armando Foxwith George Candea, James Cutler, Ben Ling, Andy Huang
Philosophical Direction • Use only dynamic, observed behavior to determine recovery technique/policy • Application independent recovery techniques • Specialize designs for fast recovery • Putting it all together: all software should be crash-only
Dynamic, Observed Behavior • A priori fault models are suspect. Base recovery strategy only on dynamically observed behavior. • Behavior may change as system or workload evolves => addresses a key difference between Internet-oriented ROC systems and traditional mission-critical systems • Kinds of observations • PinPoint: use statistical analysis to determine which groups of components are correlated with observed external faults • Automatic failure-propagation inference: use fault injection and tracing to determine propagation paths and extent of different kinds of faults
Making techniques application-generic • True application-generic recovery is hard [Lowell & Chen] • But that’s because “generic” applications are too unconstrained • Idea: if an application uses a particular “rich runtime”, that runtime may constrain application structure • Example: J2EE, a widely used enterprise app. framework • Modular Java applications, well defined component boundaries • Rich runtime system (“application server”) provides services for deployment/undeployment, naming, load balancing, integration with Web servers & databases, etc. • Instrument the platform with generic methods for fault injection and recovery (e.g., using Recursive Restartability) • Generic mechanisms: timeouts, exception propagation • Parametrizable mechanisms: progress counters, application-level pings
Example: Automatic Failure Propagation Inference • When a failure occurs in a particular software component of an application, how far does it propagate? • i.e., what part(s) of the application must be recovered • Traditionally, failure propagation information is derived by hand • Our approach: modify J2EE application server to allow capture of failure-propagation information in any J2EE app • Automatic Failure-Propagation Inference (AFPI) for JBoss: + automatically and dynamically generates f-maps with no performance overhead + no application knowledge required + finds dependencies that other analyses might miss,omits “false” dependencies that don’t result in actual failure propagation
Design for Fast Recovery • Recursive Restartability as a technique for recovery assumes... • For correctness: All components are independent and restartable (ie no data loss or other bad effects) • For performance: Restarts are relatively fast • For stateless components, this is “easy”; what about stateful components? • Correctness: eg, filesystems may suffer data loss if OS not cleanly shut down • Performance: eg, commercial RDBMS’s are crash-safe, but take a long time (minutes to hours) to recover
Fast-Recovering State Stores • Isolate state exclusively in state store components; make all other “application logic” components stateless • Instead of building a general state store, specialize it for its intended use • Goal: identify combination of specializations that facilitates construction of a very-large-scale state store (O(103) requests/sec on O(106) entries) with near-zero recovery time • Possible axes for specialization… • Is state shared across clients or not? (user profile/session state vs. updating a message board) • How powerful must the query API be? (single-key lookup, free-text search, fully relational…) • What is the intended lifetime of state? (short/session, long/forever)
Putting it together: crash-only software • Already assumed: software must be able to recover from a crash rapidly and correctly • But if it can do that…then why include separate code paths for “clean shutdown”? • All software should be crash-only; this makes it robust, easy to administer/upgrade, and amenable to RR as a recovery technique (among others) • Current explorations: • RR-ifying the platform (J2EE appserver) vs. individual applications • Improving ability to detect anomalies and failure correlations using path-based statistical analysis • Designing crash-only state stores for both session state and persistent state
Outrageous Opinions session tomorrow • tomorrow after dinner: controversial ideas/opinions, open challenges, predicting the future, ... • Please sign up on easel (coming this afternoon) • ~5-8 minutes per person to pound the pulpit and stimulate later discussion • Retreat proceedings, slides, etc. (mostly) online • Internet keyword “retreat” :-) or http://retreat or 10.0.0.1