RM3G: Next Generation Recovery Manager

RM3G: Next Generation Recovery Manager Steve Zhang and Armando Fox Stanford University

Design Goals SLTs • Overall Goal: Manage the detection of and recovery from system failures • New in 3G: Focus on online Statistical Learning Theory (SLT) algorithms for application generic failure detection • Previous generation used End-2-End and Exception monitors • Not tie ourselves to any particular algorithms and make new algorithms easy to plug-in • Standardize the APIs for observation, analysis, and control of system components • Provide common services and abstractions to SLT algorithms • RM itself must also be resilient to failures RM3G Comp

RADS Architecture Server Client Distributed Middleware Distributed Middleware User Operator SLT Services (RM3G) Application- Specific Overlay Network PNE PNE Edge Network Edge Network Router Router CommodityInternet & IP networks

Design Diagram Comp B SLT Processes Spawned by SLT Proc Srv Comp A Comp C Ctrl/Obsrv point descriptors Control policies Observation Points Control Points SLT Plug-ins Data Store Srv SLT Select Srv Ctrl Srv RM Proc Srv RMDB Name & Reg Srv

Collaboration with ACME • Infrastructure for monitoring, analyzing, and controlling Internet-scale systems • Sensors = Observation Points • Actuators = Control Points • RM potentially benefits from two ACME features • An in-network aggregator combines data from sensors as they are routed through an overlay network • Configuration language that specifies under what conditions to trigger actuators • ACME could benefit from more powerful sensor data analysis using SLTs

Observation Points • We want to avoid requiring every component to be individually instrumented • Components may directly provide their own observation data if they wish (e.g. D-store and SSM provide their own data for monitoring with Pinpoint) • Several types of observation data can be collected in an application generic way • OS can provide application level data (e.g. memory usage, number of files open, etc) and system level data (e.g. size of swap space, network ports used, etc) • Middleware can provide intra-application data (e.g. interaction between different components of an application)

SLT Data Services • Abstracts information from observation points • SLT algorithms are spawned for each component in the system, as they are instantiated • Observation data stored by SLT Data Server possibly in a streaming database. • Listens for feedback from SLT algorithms to adjust the data stream as necessary • Increase data sampling rate if anomaly is suspected • Stop reporting certain data if it is deemed to be irrelevant • Provide persistent data storage for SLT algorithms • Remember properties learned from previous analysis of observation data

Control Points • Assumes crash-only components • Components can be reliably restarted through external means (can’t rely on components restarting themselves cleanly) • Initially, only restart control points are supported • Instrument application server (JBoss) to restart applications and application components • OS can restart application servers • IP addressable power strips can restart entire nodes • Components can specify custom control policy • Leverage ACME’s configuration language

Future Work • “Master” SLT • Multiple SLTs are run for each component. Choosing which SLTs to believe is itself an interesting SLT problem. • Support additional types of control points • Multiple level settings that tune component parameters (e.g. filter level) • Support additional types of observation points • Use programming language techniques (e.g. source code transformation) to instrument applications in a generic way • Online SLT algorithms for anomaly detection are not mature

RM3G: Next Generation Recovery Manager