Intro & Overview of RADS goals

Intro & Overview of RADS goals Armando Fox & Dave Patterson CS 444A/CS 294-6, Stanford/UC Berkeley Fall 2004

Administrivia • Course logistics & registration • Project expectations and other deliverables • Background and motivation for RADS • ROC and its relationship to RADS • Early case studies • Discussion: projects, research directions, etc.

Administrivia/goals • Stanford enrollment vs. Axess • SLT and CT tutorial VHS/DVD’s available to view • SLT and CT Lab/assignments grading policy • Stanford and Berkeley meeting/transportation logistics • Format of course

Background & motivation for RADS

RADS in One Slide • Philosophy of ROC: focus on lowering MTTR to improve overall availability • ROC achievements: two levels of lowering MTTR • “Microrecovery”: fine-grained generic recovery techniques recover only the failed part(s) of the system, at much lower cost than whole-system recovery • Undo: sophisticated tools to help human operators selectively back out destructive actions/changes to a system • General approach: use microrecovery as “first line of defense”; when it fails, provide support to human operators to avoid having to “reinstall the world” • RADS insight: can combine cheap recovery with statistical anomaly detection techniques

Hence, (at least) 2 parts to RADS • Investigating other microrecovery methods • Investigating analysis techniques • What to capture/represent in a model • Addressing fundamental open challenges • stability • systematic misdiagnosis • subversion by attackers • etc. • General insight: “different is bad” • “law of large numbers” arguments support this for large services

Why RADS • Motivation • 5 9’s availability => 5 down-minutes/year => must recover from (or mask) most failures without human intervention • a principled way to design “self-*” systems • Technology • High-traffic large-scale distributed/replicated services => large datasets • Analysis is CPU-intensive => a way to trade extra CPU cycles for dependability • Large logs/datasets for models => storage is cheap and getting cheaper • RADS addresses a clear need while exploiting demonstrated technology trends

Cheap Recovery

Complex systems of black boxes • “...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of our economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC Report) • Networked services too complex and rapidly-changing to test exhaustively: “collections of black boxes” • Weekly or biweekly code drops not uncommon • Market activities lead to integration of whole systems • Need to get humans out of loop for at least some monitoring/recovery loops • hence interest in “autonomic” approaches • fast detection is often at odds with false alarms

Consequences • Complexity breeds increased bug counts and bug impact • Heisenbugs, race conditions, environment-dependent and hard-to-reproduce bugs still account for majority of SW bugs in live systems • up to 80% of bugs found in production are those for which a fix is not yet available* • some application-level failures result in user-visible bad behavior before they are detected by site monitors • Tellme Networks: up to 75% of downtime is “detection” (sometimes by user complaints), followed by localization • Amazon, Yahoo: gross metrics track second-order effect of bugs, but lags actual bug by minutes or tens of minutes • Result: downtime and increased management costs * A.P. Wood, Software reliability from the customer view, IEEE Computer, Aug. 2003

“Always adapting, always recovering” • Build statistical models of “acceptable” operating envelope by measurement & analysis on live system • Control theory, statistical correlation, anomaly detection... • Detect runtime deviations from model • typical tradeoff is between detection rate & false positive rate • Rely on external control using inexpensive and simple mechanisms that respect the black box, to keep system within its acceptable operating envelope • invariant: attempting recovery won’t make things worse • makes inevitable false positives tolerable • can then reduce false negatives by “tuning” algo’s to be more aggressive and/or deploying multiple detectors Systems that are “always adapting, always recovering”

Toward recovery management invariants • Observation: instrumentation and analysis • collect and analyze data from running systems • rely on “most systems work most of the time” to automatically derive baseline models • Analysis: detect and localize anomalous behavior • Action: close loop automatically with “micro-recovery” • “Salubrious”: returns some part of system to known state • Reclaim resources (memory, DB conns, sockets, DHCP lease...), throw away corrupt transient state, setup to retry operation if appropriate • Safe: no effect on correctness, minimal effect on performance • Localized: parts not being microrecovered aren’t affected • Fast recovery simplifies failure detection and recovery management.

Non-goals/complementary work All of the following are being capably studied by others, and directly compose with our own efforts... • Byzantine fault tolerance • In-place repair of persistent data structures • Hard-real-time response guarantees • Adding checkpointing to legacy non-componentized applications • Source code bug finding • Advancing the state of the art in SLT (analysis algorithms)

Outline • Micro-recoverable systems • Concept of microrecovery • A microrecoverable application server & session state store • Application-generic SLT-based failure detection • Path and component analysis and localization for appserver • Simple time series analyses for purpose-built state store • Combining SLT detection with microrecoverable systems • Discussion, related work, implications & conclusions

Microrebooting: one kind of microrecovery • 60+% of software failures in the field* are reboot-curable, even if root cause is unknown... why? • Rebooting discards bad temporary data (corrupted data structures that can be rebuilt) and (usually) reclaims used resources • reestablishes control flow in a predictable way (breaks deadlocks/livelocks, returns thread or process to its start state) • To avoid imperiling correctness, we must... • Separate data recovery from process recovery • Safeguard the data • Reclaim resources with high confidence • Goal: get same benefits of rebooting but at much finer grain (hence faster and less disruptive) - microrebooting * D. Oppenheimer et al., Why do Internet services fail and what can be done about it? , USITS 2003

Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

Write example: “Write to Many, Wait for Few” AppServer STUB Crashed? Slow? Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser 14 Brick 3 Brick 4 Cookie holds metadata Brick 5

Read example: AppServer STUB Try to read from Bricks 1, 4 Brick 1 14 Brick 2 Browser Brick 3 Brick 4 Brick 5

Read example: AppServer STUB 14 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

Read example: AppServer STUB Brick 1 crashes Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

Read example: AppServer STUB Brick 2 Browser Brick 3 Brick 4 Brick 5

SSM: Failure and Recovery • Failure of single node • No data loss, WQ-1 remain • State is available for R/W during failure • Recovery • Restart – No special case recovery code • State is available for R/W during brick restart • Session state is self-recovering • User’s access pattern causes data to be rewritten

Backpressure and Admission Control AppServer AppServer STUB STUB Brick 1 Brick 2 Drop Requests Brick 3 Brick 4 Brick 5 Heavy flow to Brick 3

Statistical Monitoring Pinpoint Pinpoint Statistics Statistics NumElementsMemoryUsedInboxSizeNumDroppedNumReadsNumWrites Brick 1 Brick 2 Brick 3 Brick 4 Brick 5

SSM Monitoring • N replicated bricks handle read/write requests • Cannot do structural anomaly detection! • Alternative features (performance, mem usage, etc) • Activity statistics: How often did a brick do something? • Msgs received/sec, dropped/sec, etc. • Same across all peers, assuming balanced workload • Use anomalies as likely failures • State statistics: Current state of system • Memory usage, queue length, etc. • Similar pattern across peers, but may not be in phase • Look for patterns in time-series; differences in patterns indicate failure at a node.

Detecting Anomalous Conditions • Metrics compared against those of “peer” bricks • Basic idea: Changes in workload tend to affect all bricks equally • Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time” • Anomaly in 6 or more (out of 9) metrics => reboot brick • Use different techniques for different stats • “Activity” – absolute median deviation • “State” – Tarzan time-series analysis

Network Fault – 70% packet loss in SAN Brick restarts Fault detectedBrick killed Network fault injected

J2EE as a platform for uRB-based recovery • Java 2 Enterprise Edition, a component framework for Internet request-reply style apps • App is a collection of components (“EJBs”) created by subclassing a managed container class • application server provides component creation, thread management, naming/directory services, abstractions for database and HTTP sessions, etc. • Web pages with embedded servlets and Java Server Pages invoke EJB methods • potential to improve all apps by modifying the appserver • J2EE has a strong following, encourages modular programming, and there are open source appservers

Separating data recovery from process recovery • For HTTP workloads, session state  app checkpoint • Store session state in a microrebootable session state subsystem (NSDI’04) • Recovery==non-state-preserving process restart, redundancy gives probabilistic durability • Response time cost of externalizing session state: ~25% • SSM, an N-way RAM-based state replication [NSDI 04] behind existing J2EE API • Microreboot EJB’s: • destroy all instances of EJB and associated threads • releases appserver-level resources (DB connections, etc) • discards appserver metadata about EJB’s • session state preserved across uRB

JBoss+uRB’s+SSM + fault injection Fault injection: null refs, deadlocks/infinite loop, corruption of volatile EJB metadata, resource leaks, Java runtime errors/exc RUBiS: online auction app (132K items, 1.5M bids, 100K subscribers) 150 simulated users/node35-45 req/sec/node Workload mix based on a commercial auction site Client-based failure detection

uRB vs. full RB - action weighted goodput • Example: corrupt JNDI database entry, RuntimeException, Java error; measure G_aw in 1-second buckets • Localization is crude: static analysis to associate failed URL with set of EJB’s, incrementing an EJB’s score whenever it’s implicated • With uRB’s, 89% reduction in failed requests and 9% more successful requests compared to full RB, despite 6 false positives

Performance overhead of JAGR • 150 clients/node: latency=38 msec (3 -> 7 nodes) • Human-perceptible delay: 100-200 msec • Real auction site: 41 req/sec, 33-300 msec latency

Improving availability from user’s point of view • uRB improves user-perceived availability vs. full reboot • uRB complements failover • (a) Initially, excess load on 2nd node brought it down immediately after failover • (b) uRB results in some failed requests (96% fewer) from temporary overload • (c,d) Full reboot vs. uRB without failover • For small clusters, should always try uRB first

uRB Tolerates Lax Failure Detection • Tolerates lag in detection latency (up to 53s in our microbenchmark) and high false positive rates • Our naive detection algorithm had up to 60% false positive rate in terms of what to uRB • we injected 97% false positives before reduction in overall availability equaled cost of full RB • Always safe to use as “first line of defense”, even when failover is possible • cost(uRB+other recovery)  cost(other recovery) • success rate of uRB on reboot-curable failures is comparable to whole-appserver reboot

Performance penalties • Baseline workload mix modeled on commercial site • 150 simulated clients per node, ~40-45 reqs/sec per node • system at ~70% utilization • Throughput ~1% worse due to instrumentation • worst-case response latency increases from 800 to 1200ms • Average case: 45ms to 80ms; compare to 35-300ms for commercial service • Well within “human tolerance” thresholds • Entirely due to factoring out of session state • Performance penalty is tolerable & worth it

Recovery and maintenance

Microrecovery for Maintenance Operations • Capacity discovery in SSM • TCP-inspired flow control keeps system from falling off a cliff • “OK to say no” is essential for this backpressure to work • Microrejuvenation in JAGR (proactively microreboot to fix localized memory leaks) • Splitting/coalescing in Dstore • Split = failure + reappearance of failed node • Same safe/non-disruptive recovery mechanisms are used to lazily repair inconsistencies after new node appears • Consequently, performance impact small enough to do this as an online operation

Using microrecovery for maintenance • Capacity discovery in SSM • redundancy mechanism used for recovery (“write many, wait few”) also used to “say no” while gracefully degrading performance

Full rejuvenation vs. microrejuvenation 76%

Splitting/coalescing in Dstore • Splitting/coalescing in Dstore • Split = failure + reappearance of failed node • Same mechanisms used to lazily repair inconsistencies

Summary: microrecoverable systems • Separation of data from process recovery • Special-purpose data stores can be made microrecoverable • OK to initiate microrecovery anytime for any reason • no loss of correctness, tolerable loss of performance • likely (but not guaranteed) to fix an important class of transients • won’t make things worse; can always try “full” recovery afterward • inexpensive enough to tolerate “sloppy” fault detection • low-cost first line of defense • some “maintenance” ops can be cast as microrecovery • due to low cost, “proactive” maintenance can be done online • can often convert unplanned long downtime into planned shorter performance hit

Anomaly detection as failure detection

Example: Anomaly Finding Techniques Question: does anomaly == bug? * Includes design time and build time ** Includes both offline (invasive) and online detection techniques

Examples of Badness Inference • Sometimes can detect badness by looking for inconsistencies in runtime behavior • We can observe program-specific properties (though using automated methods) as well as program-generic properties • Often, we must be able to first observe program operating “normally” • Eraser: detecting data races [Savage et al. 2000] • Observe lock/unlock patterns around shared variables • If a variable usually protected by lock/unlock or mutex is observed to have interleaved reads, report a violation • DIDUCE: inferring invariants, then detecting violations [Hangal & Lam 2002] • Start with strict invariant (“x is always =3”) • Relax it as other values seen (“x is in [0,10]”) • Increase confidence in invariant as more observations seen • Report violations of invariants that have threshold confidence

Generic runtime monitoring techniques • What conditions are we monitoring for? • Fail-stop vs. Fail-silent vs. Fail-stutter • Byzantine failures • Generic methods • Heartbeats (what does loss of heartbeat mean? Who monitors them?) • Resource monitoring (what is “abnormal”?) • Application-specific monitoring: ask a question you know the answer to • Fault model enforcement • coerce all observed faults to an “expected faults” subset • if necessary, take additional actions to completely “induce” the fault • Simplifies recovery since fewer distinct cases • Avoids potential misdiagnosis of faults that have common symptoms • Note, may sometimes appear to make things “worse” (coerce a less-severe fault to a more-severe fault) • Doesn’t exercise all parts of the system

Internet performance failure detection • Various approaches, all of which exploit the law of large numbers and (sort of) Central Limit Theorem (which is?) • Establish “baseline” of quantity to be monitored • Take observations, factor out data from known failures • Normalize to workload? • Look for “significant” deviations from baseline • What to measure? • Coarse-grain: number of reqs/sec • Finer-grain: Number of TCP connections in Established, Syn_sent, Syn_rcvd state • Even finer: additional internal request “milestones” • Hard to do in an application-generic way...but frameworks can save us

Example 1: Detection & recovery in SSM • 9 “State” statistics collected per second from each replica • Tarzan time series analysis* compares relative frequencies of substrings corresponding to discretized time series • “anomalous” => at least 6 stats “anomalous”; works for aperiodic or irregular-period signals • robust against workload changes that affect all replicas equally and against highly-correlated metrics *Keogh et al., Finding surprising patterns in a time series database in linear time and space, SIGKDD 2002

Intro & Overview of RADS goals