1 / 73

Intro & Overview of RADS goals

Intro & Overview of RADS goals. Armando Fox & Dave Patterson CS 444A/CS 294-6, Stanford/UC Berkeley Fall 2004. Administrivia Course logistics & registration Project expectations and other deliverables Background and motivation for RADS ROC and its relationship to RADS Early case studies

kbean
Download Presentation

Intro & Overview of RADS goals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intro & Overview of RADS goals Armando Fox & Dave Patterson CS 444A/CS 294-6, Stanford/UC Berkeley Fall 2004

  2. Administrivia • Course logistics & registration • Project expectations and other deliverables • Background and motivation for RADS • ROC and its relationship to RADS • Early case studies • Discussion: projects, research directions, etc.

  3. Administrivia/goals • Stanford enrollment vs. Axess • SLT and CT tutorial VHS/DVD’s available to view • SLT and CT Lab/assignments grading policy • Stanford and Berkeley meeting/transportation logistics • Format of course

  4. Background & motivation for RADS

  5. RADS in One Slide • Philosophy of ROC: focus on lowering MTTR to improve overall availability • ROC achievements: two levels of lowering MTTR • “Microrecovery”: fine-grained generic recovery techniques recover only the failed part(s) of the system, at much lower cost than whole-system recovery • Undo: sophisticated tools to help human operators selectively back out destructive actions/changes to a system • General approach: use microrecovery as “first line of defense”; when it fails, provide support to human operators to avoid having to “reinstall the world” • RADS insight: can combine cheap recovery with statistical anomaly detection techniques

  6. Hence, (at least) 2 parts to RADS • Investigating other microrecovery methods • Investigating analysis techniques • What to capture/represent in a model • Addressing fundamental open challenges • stability • systematic misdiagnosis • subversion by attackers • etc. • General insight: “different is bad” • “law of large numbers” arguments support this for large services

  7. Why RADS • Motivation • 5 9’s availability => 5 down-minutes/year => must recover from (or mask) most failures without human intervention • a principled way to design “self-*” systems • Technology • High-traffic large-scale distributed/replicated services => large datasets • Analysis is CPU-intensive => a way to trade extra CPU cycles for dependability • Large logs/datasets for models => storage is cheap and getting cheaper • RADS addresses a clear need while exploiting demonstrated technology trends

  8. Cheap Recovery

  9. Complex systems of black boxes • “...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of our economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC Report) • Networked services too complex and rapidly-changing to test exhaustively: “collections of black boxes” • Weekly or biweekly code drops not uncommon • Market activities lead to integration of whole systems • Need to get humans out of loop for at least some monitoring/recovery loops • hence interest in “autonomic” approaches • fast detection is often at odds with false alarms

  10. Consequences • Complexity breeds increased bug counts and bug impact • Heisenbugs, race conditions, environment-dependent and hard-to-reproduce bugs still account for majority of SW bugs in live systems • up to 80% of bugs found in production are those for which a fix is not yet available* • some application-level failures result in user-visible bad behavior before they are detected by site monitors • Tellme Networks: up to 75% of downtime is “detection” (sometimes by user complaints), followed by localization • Amazon, Yahoo: gross metrics track second-order effect of bugs, but lags actual bug by minutes or tens of minutes • Result: downtime and increased management costs * A.P. Wood, Software reliability from the customer view, IEEE Computer, Aug. 2003

  11. “Always adapting, always recovering” • Build statistical models of “acceptable” operating envelope by measurement & analysis on live system • Control theory, statistical correlation, anomaly detection... • Detect runtime deviations from model • typical tradeoff is between detection rate & false positive rate • Rely on external control using inexpensive and simple mechanisms that respect the black box, to keep system within its acceptable operating envelope • invariant: attempting recovery won’t make things worse • makes inevitable false positives tolerable • can then reduce false negatives by “tuning” algo’s to be more aggressive and/or deploying multiple detectors Systems that are “always adapting, always recovering”

  12. Toward recovery management invariants • Observation: instrumentation and analysis • collect and analyze data from running systems • rely on “most systems work most of the time” to automatically derive baseline models • Analysis: detect and localize anomalous behavior • Action: close loop automatically with “micro-recovery” • “Salubrious”: returns some part of system to known state • Reclaim resources (memory, DB conns, sockets, DHCP lease...), throw away corrupt transient state, setup to retry operation if appropriate • Safe: no effect on correctness, minimal effect on performance • Localized: parts not being microrecovered aren’t affected • Fast recovery simplifies failure detection and recovery management.

  13. Non-goals/complementary work All of the following are being capably studied by others, and directly compose with our own efforts... • Byzantine fault tolerance • In-place repair of persistent data structures • Hard-real-time response guarantees • Adding checkpointing to legacy non-componentized applications • Source code bug finding • Advancing the state of the art in SLT (analysis algorithms)

  14. Outline • Micro-recoverable systems • Concept of microrecovery • A microrecoverable application server & session state store • Application-generic SLT-based failure detection • Path and component analysis and localization for appserver • Simple time series analyses for purpose-built state store • Combining SLT detection with microrecoverable systems • Discussion, related work, implications & conclusions

  15. Microrebooting: one kind of microrecovery • 60+% of software failures in the field* are reboot-curable, even if root cause is unknown... why? • Rebooting discards bad temporary data (corrupted data structures that can be rebuilt) and (usually) reclaims used resources • reestablishes control flow in a predictable way (breaks deadlocks/livelocks, returns thread or process to its start state) • To avoid imperiling correctness, we must... • Separate data recovery from process recovery • Safeguard the data • Reclaim resources with high confidence • Goal: get same benefits of rebooting but at much finer grain (hence faster and less disruptive) - microrebooting * D. Oppenheimer et al., Why do Internet services fail and what can be done about it? , USITS 2003

  16. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  17. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  18. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  19. Write example: “Write to Many, Wait for Few” AppServer STUB Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  20. Write example: “Write to Many, Wait for Few” AppServer STUB Crashed? Slow? Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2 Brick 1 Brick 2 Browser 14 Brick 3 Brick 4 Cookie holds metadata Brick 5

  21. Read example: AppServer STUB Try to read from Bricks 1, 4 Brick 1 14 Brick 2 Browser Brick 3 Brick 4 Brick 5

  22. Read example: AppServer STUB 14 Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  23. Read example: AppServer STUB Brick 1 crashes Brick 1 Brick 2 Browser Brick 3 Brick 4 Brick 5

  24. Read example: AppServer STUB Brick 2 Browser Brick 3 Brick 4 Brick 5

  25. SSM: Failure and Recovery • Failure of single node • No data loss, WQ-1 remain • State is available for R/W during failure • Recovery • Restart – No special case recovery code • State is available for R/W during brick restart • Session state is self-recovering • User’s access pattern causes data to be rewritten

  26. Backpressure and Admission Control AppServer AppServer STUB STUB Brick 1 Brick 2 Drop Requests Brick 3 Brick 4 Brick 5 Heavy flow to Brick 3

  27. Statistical Monitoring Pinpoint Pinpoint Statistics Statistics NumElementsMemoryUsedInboxSizeNumDroppedNumReadsNumWrites Brick 1 Brick 2 Brick 3 Brick 4 Brick 5

  28. SSM Monitoring • N replicated bricks handle read/write requests • Cannot do structural anomaly detection! • Alternative features (performance, mem usage, etc) • Activity statistics: How often did a brick do something? • Msgs received/sec, dropped/sec, etc. • Same across all peers, assuming balanced workload • Use anomalies as likely failures • State statistics: Current state of system • Memory usage, queue length, etc. • Similar pattern across peers, but may not be in phase • Look for patterns in time-series; differences in patterns indicate failure at a node.

  29. Detecting Anomalous Conditions • Metrics compared against those of “peer” bricks • Basic idea: Changes in workload tend to affect all bricks equally • Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time” • Anomaly in 6 or more (out of 9) metrics => reboot brick • Use different techniques for different stats • “Activity” – absolute median deviation • “State” – Tarzan time-series analysis

  30. Network Fault – 70% packet loss in SAN Brick restarts Fault detectedBrick killed Network fault injected

  31. J2EE as a platform for uRB-based recovery • Java 2 Enterprise Edition, a component framework for Internet request-reply style apps • App is a collection of components (“EJBs”) created by subclassing a managed container class • application server provides component creation, thread management, naming/directory services, abstractions for database and HTTP sessions, etc. • Web pages with embedded servlets and Java Server Pages invoke EJB methods • potential to improve all apps by modifying the appserver • J2EE has a strong following, encourages modular programming, and there are open source appservers

  32. Separating data recovery from process recovery • For HTTP workloads, session state  app checkpoint • Store session state in a microrebootable session state subsystem (NSDI’04) • Recovery==non-state-preserving process restart, redundancy gives probabilistic durability • Response time cost of externalizing session state: ~25% • SSM, an N-way RAM-based state replication [NSDI 04] behind existing J2EE API • Microreboot EJB’s: • destroy all instances of EJB and associated threads • releases appserver-level resources (DB connections, etc) • discards appserver metadata about EJB’s • session state preserved across uRB

  33. JBoss+uRB’s+SSM + fault injection Fault injection: null refs, deadlocks/infinite loop, corruption of volatile EJB metadata, resource leaks, Java runtime errors/exc RUBiS: online auction app (132K items, 1.5M bids, 100K subscribers) 150 simulated users/node35-45 req/sec/node Workload mix based on a commercial auction site Client-based failure detection

  34. uRB vs. full RB - action weighted goodput • Example: corrupt JNDI database entry, RuntimeException, Java error; measure G_aw in 1-second buckets • Localization is crude: static analysis to associate failed URL with set of EJB’s, incrementing an EJB’s score whenever it’s implicated • With uRB’s, 89% reduction in failed requests and 9% more successful requests compared to full RB, despite 6 false positives

  35. Performance overhead of JAGR • 150 clients/node: latency=38 msec (3 -> 7 nodes) • Human-perceptible delay: 100-200 msec • Real auction site: 41 req/sec, 33-300 msec latency

  36. Improving availability from user’s point of view • uRB improves user-perceived availability vs. full reboot • uRB complements failover • (a) Initially, excess load on 2nd node brought it down immediately after failover • (b) uRB results in some failed requests (96% fewer) from temporary overload • (c,d) Full reboot vs. uRB without failover • For small clusters, should always try uRB first

  37. uRB Tolerates Lax Failure Detection • Tolerates lag in detection latency (up to 53s in our microbenchmark) and high false positive rates • Our naive detection algorithm had up to 60% false positive rate in terms of what to uRB • we injected 97% false positives before reduction in overall availability equaled cost of full RB • Always safe to use as “first line of defense”, even when failover is possible • cost(uRB+other recovery)  cost(other recovery) • success rate of uRB on reboot-curable failures is comparable to whole-appserver reboot

  38. Performance penalties • Baseline workload mix modeled on commercial site • 150 simulated clients per node, ~40-45 reqs/sec per node • system at ~70% utilization • Throughput ~1% worse due to instrumentation • worst-case response latency increases from 800 to 1200ms • Average case: 45ms to 80ms; compare to 35-300ms for commercial service • Well within “human tolerance” thresholds • Entirely due to factoring out of session state • Performance penalty is tolerable & worth it

  39. Recovery and maintenance

  40. Microrecovery for Maintenance Operations • Capacity discovery in SSM • TCP-inspired flow control keeps system from falling off a cliff • “OK to say no” is essential for this backpressure to work • Microrejuvenation in JAGR (proactively microreboot to fix localized memory leaks) • Splitting/coalescing in Dstore • Split = failure + reappearance of failed node • Same safe/non-disruptive recovery mechanisms are used to lazily repair inconsistencies after new node appears • Consequently, performance impact small enough to do this as an online operation

  41. Using microrecovery for maintenance • Capacity discovery in SSM • redundancy mechanism used for recovery (“write many, wait few”) also used to “say no” while gracefully degrading performance

  42. Full rejuvenation vs. microrejuvenation 76%

  43. Splitting/coalescing in Dstore • Splitting/coalescing in Dstore • Split = failure + reappearance of failed node • Same mechanisms used to lazily repair inconsistencies

  44. Summary: microrecoverable systems • Separation of data from process recovery • Special-purpose data stores can be made microrecoverable • OK to initiate microrecovery anytime for any reason • no loss of correctness, tolerable loss of performance • likely (but not guaranteed) to fix an important class of transients • won’t make things worse; can always try “full” recovery afterward • inexpensive enough to tolerate “sloppy” fault detection • low-cost first line of defense • some “maintenance” ops can be cast as microrecovery • due to low cost, “proactive” maintenance can be done online • can often convert unplanned long downtime into planned shorter performance hit

  45. Anomaly detection as failure detection

  46. Example: Anomaly Finding Techniques Question: does anomaly == bug? * Includes design time and build time ** Includes both offline (invasive) and online detection techniques

  47. Examples of Badness Inference • Sometimes can detect badness by looking for inconsistencies in runtime behavior • We can observe program-specific properties (though using automated methods) as well as program-generic properties • Often, we must be able to first observe program operating “normally” • Eraser: detecting data races [Savage et al. 2000] • Observe lock/unlock patterns around shared variables • If a variable usually protected by lock/unlock or mutex is observed to have interleaved reads, report a violation • DIDUCE: inferring invariants, then detecting violations [Hangal & Lam 2002] • Start with strict invariant (“x is always =3”) • Relax it as other values seen (“x is in [0,10]”) • Increase confidence in invariant as more observations seen • Report violations of invariants that have threshold confidence

  48. Generic runtime monitoring techniques • What conditions are we monitoring for? • Fail-stop vs. Fail-silent vs. Fail-stutter • Byzantine failures • Generic methods • Heartbeats (what does loss of heartbeat mean? Who monitors them?) • Resource monitoring (what is “abnormal”?) • Application-specific monitoring: ask a question you know the answer to • Fault model enforcement • coerce all observed faults to an “expected faults” subset • if necessary, take additional actions to completely “induce” the fault • Simplifies recovery since fewer distinct cases • Avoids potential misdiagnosis of faults that have common symptoms • Note, may sometimes appear to make things “worse” (coerce a less-severe fault to a more-severe fault) • Doesn’t exercise all parts of the system

  49. Internet performance failure detection • Various approaches, all of which exploit the law of large numbers and (sort of) Central Limit Theorem (which is?) • Establish “baseline” of quantity to be monitored • Take observations, factor out data from known failures • Normalize to workload? • Look for “significant” deviations from baseline • What to measure? • Coarse-grain: number of reqs/sec • Finer-grain: Number of TCP connections in Established, Syn_sent, Syn_rcvd state • Even finer: additional internal request “milestones” • Hard to do in an application-generic way...but frameworks can save us

  50. Example 1: Detection & recovery in SSM • 9 “State” statistics collected per second from each replica • Tarzan time series analysis* compares relative frequencies of substrings corresponding to discretized time series • “anomalous” => at least 6 stats “anomalous”; works for aperiodic or irregular-period signals • robust against workload changes that affect all replicas equally and against highly-correlated metrics *Keogh et al., Finding surprising patterns in a time series database in linear time and space, SIGKDD 2002

More Related