240 likes | 341 Views
Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi. Introduction . Autonomic Problem Approach Results Discussion. The Autonomic Problem. To allow the application to recover automatically from transient and intermittent software failure. The Approach.
E N D
Autonomous Recovery in Componentized Internet ApplicationCandea et. alVikram Negi
Introduction • Autonomic Problem • Approach • Results • Discussion
The Autonomic Problem • To allow the application to recover automatically from transient and intermittent software failure.
The Approach • Introduce the idea : • Microanalysis (fault detection) • Microrebooting (rapid recovery) • External Management (recovery action) • Integrate and Test with JBOSS
Design Overview • Autonomous Process • Monitoring • Java probes • Fault detection • Generate Anomaly report • Recovery • Takes action • Total time to recovery.
J2EE Review • J2EE enterprise apps = collection of reusable Java modules • JSPs / servlets invoke EJBs, which invoke other EJBs, ... • EJB = Java component that complies to a certain interface and provides a service • Deployment descriptor (per-bean XML file) conveys run-time characteristics and dependencies; used in deploying the application
JBoss Design • Open-source J2EE app server • Written entirely in Java • Microkernel with components held together by JMX (Mgmt Support)
JAGR = ROC-ified JBoss with Application-Generic Recovery • 3 Tier Architecture • Key Components • Macro analysis Engine • Microrebooting Hook • Recovery Manager
Pinpoint : Detection and Localization • Store Observation • IP address of machine, timestamp • Globally unique request ID. • # of calls/returns to EJB’s • Association between sender and receiver. • Collect SQL Queries, update, read
Pinpoint : Analysis • Analysis Engine • Centralized Engine • Plugin based architecture • Modeling Components • Assume both present component behavior and historical (normal) behavior have same probability distribution. • Ki square test to determine different probability distribution.
Recovery : micro-reboot is not expensive • State Segregation • Store impt. state outside the application in database. • Persistent State • CMP (container managed persistence, J2EE) is a requirement for prototype. • Session State • Store in modified SSM(external session state store) • Containment and Reintegration • Microreboot transitive closure of all inter-EJB references • XML deployment descriptors to determine grouping for closure • Complete or micro reboot
Recovery • Enabling Micro reboot • Method in JBOSS EJB Container • Preserve Class Loader
Manage Recovery • Recovery Policy • Read failure report consider components > 1.0 • Micro-reboot(top n) or all >1.0 • Allow delay (~30sec) • If error is present still try few time or reboot completely • Finally report it to sys admin
Evaluation Test Framework • Application • Petstore 1.1 (12 comp, 233 java file, 11K Loc) • Petstore 1.3.1(47 comp, 310 java file 10K Loc) • RUBiS (21 comp, 500 java file , 25K Loc) • Workload • Implement Simulators with Transition table. • 350 client (max utilization principle) • Faultload • Based on industry experience • No low level hardware or OS faults.
Evaluation Detection • Result similar to other detector • No discussion on absolute numbers? • Forced Java Runtime/Declared Exceptions, call emission and src code bug • 1# How well the fault was detected, 2#how well major outage was detected ?
Evaluation : Localization Localization % for a algorithm per fault type CIA > 85% No absolute data again ?
Evaluation : Recovery • Introduce faults in SSM-RUBiS. • Restart SSM-RUBiS or micro reboot component. • Observation from 10 trials per 350 concurrent client.
Full v/s Micro reboot • Injected a null reference fault in SB CommitBid, then a corrupt User-Item, SB BrowseCategories and SB CommitUserFeedback. • Microreboot maintains steady response. • 425 vs 3916 failed request • 61527 vs 56028 success request • What error condition did other trials had?
Total Recovery Time • Corrupt SB_ViewItem set it to NULL. • 19.4 sec TRT • 18.5 sec in analysis • Pinpoint is bottleneck in micro reboot.
Pinpoint is app generic ? • Upgrade to Petstore v.1.3.2 • Works for the confidence interval How different was the updated version??
Perfomance Overload • Results for 30min fault free run w/ 350 clients • In memory v/s Out memory (SSM) • Marshalling costs
Assumption • Well defined interface for components (.Net,J2ee) • Deterministic call path b/w component • No critical service request • Training data for statistical model • Guidelines (Crash Only Software)
Discussion • Overall one of the Good Papers maybe bit verbose in introduction ! • Integrating framework for earlier work by Candea. • Limitation of the present statistical model. • Shared EJB state • Modify JIT, disable microreboots(ref, static var) • Application – Global data not scrubbed. • Cost Benefit : micro reboot v/s total reboot
Supplementary • Application server = operating system for Internet applications (instantiates app components in containers, provides runtime system services, integrates with web server to make app webaccessible) • http://people.epfl.ch/george.candea