Fast Recovery + Statistical Anomaly Detection = Self-*

Fast Recovery + Statistical Anomaly Detection = Self-* RADS/KATZ CATS Panel June 2004 ROC Retreat

Outline • Motivation & approach: complex systems of black boxes • Measurements that respect black boxes • Box-level Micro-recovery cheap enough to survive false positives • Differences from related efforts • Early case studies • Research agenda

Complex Systems of Black Boxes • “...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of our economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC Report) • Build model of “acceptable” operating envelope by measurement & analysis • Control theory, statistical correlation, anomaly detection... • Rely on external control, using inexpensive and simple mechanisms that respect the black box, to keep system in its acceptable operating envelope • “Increase the size of the DB connection pool” [Hellerstein et al] • “Reallocate one or more whole machines” [Lassettre et al] • “Rejuvenate/reboot one or more machines” [Trivedi, Fox, others] • “Shoot one of the blocked txns” [everyone] • “Induce memory pressure on other apps” [Waldspurger et al]

Differences from some existing problems • intrusion detection (Hofmeyr et al 98, others) • Detections must be actionablein a way that is likely to improve system (sacrificing availability for safety is unacceptable) • bug finding via anomaly detection (Engler, others) • Human-level monitoring/verification of detections not feasible, due to number of observations and short timescales for reaction • Can separate recovery from diagnosis/repair (don’t always need to know root cause to recover) • modeling/predicting SLO violations (Hellerstein, Goldszmidt, others) • Labeled training set not necessarily available

Many other examples, but the point is... Statistical techniques identify “interesting” features and relationships from large datasets, but frequent tradeoff between detection rate (or detection time) and false positives Make “micro-recovery” so inexpensive that occasional false positives don’t matter • Granularity of black box should match granularity of available external control mechanisms

“Micro-recovery” to survive false positives • Goal: provide “recovery management invariants” • “Salubrious”: returns some part of system to known state • Reclaim resources (memory, DB conns, sockets, DHCP lease...) • Throw away corrupt transient state • Possibly setup to retry operation, if appropriate • Safe: affects only performance, not correctness • Non-disruptive: performance impact is “small” • Predictable: impact and time-to-complete is stable Observe, Analyze, Act: Not recovery, but continuous adaptation

Crash-Only Building Blocks • Control points are safe, predictable, non-disruptive • Crash-only design: shutdown=crash, recover=restart • Makes state-management subsystems as easy to manage as stateless Web servers

Example: Managing DStore and SSM • Rebooting is the only control mechanism • Has predictable effect and takes predictable time, regardless of what the process is doing • Like kill -9, “turning off” a VM, or pulling power cord • Intuition: the “infrastructure” supporting the power switch is simpler than the applications using it • Due to slight overprovisioning inherent in replication, rebooting can have minimal effect on throughput & latency • Relaxed consistency guarantees allow this to work • Activity and state statistics collected per brick every second; any deviation => reboot brick • Makes it as easy as managing a stateless server farm • Backpressure at many design points prevents saturation

Design Lessons Learned So Far • “A spectrum of cleaning operations” (Eric Anderson, HP Labs) • Consequence: as t, all problems will converge to “repair of corrupted persistent data” • Trade “unnecessary” consistency for faster recovery • spread recovery actions out incrementally/lazily (read repair) rather than doing it all at once (log replay) • gives predictable return-to-service time and acceptable variation in performance after recovery • keeps data available for reads and writes throughout “recovery” • Use single phase ops to avoid coupling/locking and the issues they raise, and justify the cost in consistency • It’s OK to say no (backpressure) • Several places our design got it wrong in SSM • But even those mistakes could have been worked around by guard timers

Potential Limitations and Challenges • Hard failures • Configuration failures • Although similar approach has been used to troubleshoot those • Corruption of persistent state • Data structure repair work (Rinard et al.) may be combinable with automatic inference (Lam et al.) • Challenges • Stability and the “autopilot problem” • The base-rate fallacy • Multilevel learning • Online implementations of SLT techniques • Nonintrusive data collection and storage

An Architecture for Observe, Analyze, Act Datacenter boundary Application server Client requests Application component Responses Recovery actions toother datacenters Recovery synthesis Observations fromother datacenters Collection Onlinealgo. Onlinealgo. Short-term store Offlinealgo. Offlinealgo. Observations toother datacenters Long-term store • Separates systems concerns from algorithm development • Programmable network elements provide extension of approach to other layers • Consistent with technology trends • Explicit //ism in CPU usage • Lots of disk storage with limited bandwidth

Conclusion • “...Ultimately, these aspects [of autonomic systems] will be emergent properties of a general architecture, and distinctions will blur into a more general notion of self-maintenance.” (The Vision of Autonomic Computing) The real reason to reduce MTTRis to tolerate false positives: recovery  adaptation

Breakout sessions? • [James H] Reserve some resources to deal with problems (by filtering or pre-reservation) • [Joe H] How black is the black box? What “gray box” prior knowledge can you exploit (so you don’t ignore the obvious)? • [Joe H] Human role - can make statements about how system should act, so doesn’t have to be completely hands-off training. Similarly, during training, human can give feedback about what anomalies are actually relevant (labeling). • [Lakshmi] What kinds of apps is this intended to apply to? Where do ROC-like and OASIS-like apps differ? • [Mary Baker] People can learn to game the system -> randomness can be your friend. If behaviors have small number of modes, just have to look for behaviors in the “valleys”

Breakouts • 19 -“golden nuggets” to guide architecture, e.g., persistent identifiers for path-based analysis...what else? • 8 - act: what {safe,fast,predictable} behaviors of the system should we expose (other than, eg, rebooting)? Esp. those that contribute to security as well as dependability? • 11 - architectures for different types of stateful systems - what kinds of persistent/semi-persistent state need to be factored out of apps, and how to store it; interfaces, etc • 20 - Given your goal of “generic” techniques for distributed systems, how will you know when you’ve succeeded/how do you validate the techniques? (What are the “proof points” you can hand to others to convince them you’ve succeeded, including but not limited to metrics?) [Aaron/Dave] Metrics: How do you know you’re observing the right things? What benchmarks will be needed?

Open Mic • James Hamilton - The Security Economy

Conclusion • Toward “new science” in autonomic computing • “...Ultimately, these aspects [of autonomic systems] will be emergent properties of a general architecture, and distinctions will blur into a more general notion of self-maintenance.” (The Vision of Autonomic Computing) The real reason to reduce MTTRis to tolerate false positives: recovery  adaptation

Autonomic & Technology Trends • CPU speed increases slowing down, need more explicit parallelism • Use extra CPU to collect and locally analyze data; exploit temporal locality • Disk space is free (though bandwidth and disaster-recovery aren’t) • Can keep history of parallel as well as historical models for regression analysis, trending, etc. • VM’s being used as unit of software distribution • Fault isolation • Opportunity for nonintrusive observation • Action that is independent of the hosted app

Data collection & monitoring • Component frameworks allow for non-intrusive data collection without modifying the applications • Inter-EJB calls through runtime-managed level of indirection • Slightly coarser grain of analysis: restrictions on “legal” paths make it more likely we can spot anomalies • Aspect-oriented programming allows further monitoring without perturbing application logic • Virtual machine monitors provide additional observation points • Already used by ASP’s, for load balancing, app migration, etc. • Transparent to applications and hosted OS’s • Likely to become the unit of software distribution (intra- and inter-cluster)

Optimizing for Specialized State Types • Two single-key (“Berkeley DB”) get/set state stores • Used for user session state, application workflow state, persistent user profiles, merchandise catalogs, ... • Replication to a set of N bricks provides durability • Write to subset, wait for subset, remember subset • DStore: state persists “forever” as long as N/2 bricks survive • SSM: If client loses cookie, state is lost; otherwise, persists for time t with probability p, where t, p = F(N, node MTBF) • Recovery==restart, takes seconds or less • Efficacy doesn’t depend on whether replica is behaving correctly • SSM: node state not preserved (in-memory only) • DStore: node state preserved, read-repair fixes

Detection & recovery in SSM • 9 “State” statistics collected once per second from each brick • Tarzan time series analysis: keep N-length time series, discretize each data point • count relative frequencies of all substrings of length k or shorter • compare against peer bricks; reboot if at least 6 stats “anomalous”; works for aperiodic or irregular-period signals • Remember! We are not SLT/ML researchers!

Detection & recovery in DStore • Metrics and algorithm comparable to those used in SSM • We inject “fail-stutter” behavior by increasing request latency • Bottom case: more aggressive detection also results in 2 “unnecessary” reboots • But they don’t matter much • Currently some voodoo constants for thresholds in both SSM and DStore Trade-off of fast detection vs. false positives

What faults does this handle? • Substantially all non-Byzantine faults we injected: • Node crash, hang/timeout/freeze • Fail-stutter: Network loss (drop up to 70% of packets randomly) • Periodic slowdown (eg from garbage collection) • Persistent slowdown (one node lags the others) • Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time” • All anomalies can be safely “coerced” to crash faults • If that turned out to be the wrong thing, it didn’t cost you much to try it • Human notified after threshold number of restarts These systems are “always recovering”

Path-based analysis + Microreboots Across all expts:80% detection rate with 1.8% FP rate Detection rate Across 92% of expts: 40% detection rate with 0.2% FP rate False positive rate • Pinpoint captures execution paths through EJB’s as dynamic call trees (intra-method calls hidden) • Build probabilistic context-free grammar from these • Detect trees that correspond to very low probability parses • Respond by micro-rebooting(uRB) suspected-faulty EJB’s • uRB takes 100’s of msecs, vs.whole-app restart (8-10 sec) • Component interaction analysiscurrently finds 55-75% of failures • Path shape analysis detects >90% of failures; but correctlylocalizes fewer

Crash-Only Design Lessons from SSM • Eliminate coupling • No dependence on any specific brick, just on a subset of minimum size -- even at the granularity of individual requests • Not even across phases of an operation: single-phase nonblocking ops only => predictable amount of work/request • Use randomness to avoid deterministic worst cases and hotspots • We initially violated this guideline by using an off-the-shelf JMS implementation that was centralized • Make parts interchangeable • Any replica in a write-set is as good as any other • Unlike erasure coding, only need 1 replica to survive • Cost is higher storage overhead, but we’re willing to pay that to get the self-* properties

Enterprise Service Workloads 3. We can continuously extract models from the production system orthogonally to the application

Building models through measurement • Finding bugs using distributed assertion sampling [Liblit et al, 2003] • Instrument source code with assertions on pairs of variables (“features”) • Use sampling so that any given run of program exercises only a few assertions (to limit performance impact) • Use classification algorithm to identify which features are most predictive of faults (observed program crashes) • Goal: bug finding

JAGR: JBoss with Micro-reboots • performability of RUBiS (goodput/sec vs. time) • vanilla JBoss w/manual restarting of app-server, vs. JAGR w/automatic recovery and micro-rebooting • JAGR/RUBiS does 78% better than JBoss/RUBiS • Maintains 20 req/sec, even in the face of faults • Lower steady-state after recovery in first graph: class reloading, recompiling, etc., which is not necessary with micro-reboots • Also used to fix memory leaks without rebooting whole appserver

Fast Recovery + Statistical Anomaly Detection = Self-* Armando Fox and Emre Kiciman, Stanford UniversityMichael Jordan, Randy Katz, David Patterson, Ion Stoica,University of California, Berkeley SoS Workshop, Bertinoro, Italy

Fast Recovery + Statistical Anomaly Detection = Self-*