130 likes | 220 Views
Recursive Restartability In a Networked Ground Station (RRINGS). Rushabh Doshi and Rakesh Gowda Computer Science Department Stanford University. Introduction. Hypothesis:
E N D
Recursive Restartability In a Networked Ground Station (RRINGS) Rushabh Doshi and Rakesh Gowda Computer Science DepartmentStanford University
Introduction • Hypothesis: • In conjunction with fault detection, enabling the ground station (GS) for Recursive Restartability (RR) will increase system availability • Approach: • Verify the applicability of RR to a single GS node. • Design a framework for enabling RR in new/existing GS modules and systems. • Integrate with Fault Detection (FD) component. Fall 2001 - CS444A
Current State of Art • Restart scalpel is a novel approach (Candea , Fox). • Sledgehammer Restarts • MS cluster Server (formerly Wolfpack) uses clustering and application level restarts to achieve higher availability. • Unnamed Internet portal does prophylactic restarts on Apache • However, none of above use an RR scalpel • We are developing RR scalpel techniques Fall 2001 - CS444A
Program Flow • Wait for a fault message from Fault Detector • Consult an oracle to tell you what to restart • Restart those components • A decision tree is the oracle • Construct the decision tree • Capturing restart dependency information Fall 2001 - CS444A
RR Tree • RR Tree captures Restart dependency information • Parents must be able to restart children ise istr istu Pipeline pbcom fedr ise: IServiceEstimator istr: IserviceTracker istu: IServiceTuner fedr: FedRadio pbcom: PipelineByteCOMPort Fall 2001 - CS444A
From RR Trees to Decision Trees • Components have different restart times • Components have different failure rates • Use this information to augment Decision Tree • Preserve dependencies • Reduce MTTR • Move slower components up, push faster components down • Capture historical information: Groups of components that fail together • Move high-failure components to single nodes Fall 2001 - CS444A
Restructuring helps! • Sample Restart times for different components Fall 2001 - CS444A
Better RR Tree Pipeline ise istr istu pbcom ise: IServiceEstimator istr: IserviceTracker istu: IServiceTuner fedr: FedRadio pbcom: PipelineByteCOMPort fedr Fall 2001 - CS444A
Making the Decision • Algorithm: • Get a fault, restart the node and children • May not be able to kill the node • Restart may not solve the problem • If this does not fix the problem • Retry a constant number of times • Go up one level • Repeat • Log all faults and restarts Fall 2001 - CS444A
Kill – Restart mechanism • Need for a softer kill • All components may not be misbehaving • Give components a chance to free resources • If soft kill fails, follow with hard kill • kill – 9 system call on linux • Restart implemented as a java System.exec(…) call Fall 2001 - CS444A
Designing a system for RR • Goal is to decrease MTTR • Decompose components into smaller pieces • Advantages • Fault isolation • Move slow-restart pieces up (and fast-restart down) • Significantly decreases MTTR • Example: fedr and pbom4 • Disadvantages • Some components may not be decomposable • IPC can make things difficult (they were together for a reason) – coordination aspect • State management Fall 2001 - CS444A
State Management • Stateful components need to resynchronize after restart • Resynch complexity is a function of system design • GS Resynchronization • All components keep softstate • “Hardstate” in control GUI that we are not modeling here. • Future GS Resynchronization • Protect system goal state in a “safe” stable storage. • Components refresh from this stable storage • Details not yet defined. Fall 2001 - CS444A
Results • Increased reliability in GS through RR • Developed framework for enabling new GS modules • Future work: • Develop protected stable storage techniques • Extend framework for a multi-component GS • Extend framework to a federated Virtual GS Fall 2001 - CS444A