1 / 13

Recursive Restartability In a Networked Ground Station (RRINGS)

Recursive Restartability In a Networked Ground Station (RRINGS). Rushabh Doshi and Rakesh Gowda Computer Science Department Stanford University. Introduction. Hypothesis:

phong
Download Presentation

Recursive Restartability In a Networked Ground Station (RRINGS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recursive Restartability In a Networked Ground Station (RRINGS) Rushabh Doshi and Rakesh Gowda Computer Science DepartmentStanford University

  2. Introduction • Hypothesis: • In conjunction with fault detection, enabling the ground station (GS) for Recursive Restartability (RR) will increase system availability • Approach: • Verify the applicability of RR to a single GS node. • Design a framework for enabling RR in new/existing GS modules and systems. • Integrate with Fault Detection (FD) component. Fall 2001 - CS444A

  3. Current State of Art • Restart scalpel is a novel approach (Candea , Fox). • Sledgehammer Restarts • MS cluster Server (formerly Wolfpack) uses clustering and application level restarts to achieve higher availability. • Unnamed Internet portal does prophylactic restarts on Apache • However, none of above use an RR scalpel • We are developing RR scalpel techniques Fall 2001 - CS444A

  4. Program Flow • Wait for a fault message from Fault Detector • Consult an oracle to tell you what to restart • Restart those components • A decision tree is the oracle • Construct the decision tree • Capturing restart dependency information Fall 2001 - CS444A

  5. RR Tree • RR Tree captures Restart dependency information • Parents must be able to restart children ise istr istu Pipeline pbcom fedr ise: IServiceEstimator istr: IserviceTracker istu: IServiceTuner fedr: FedRadio pbcom: PipelineByteCOMPort Fall 2001 - CS444A

  6. From RR Trees to Decision Trees • Components have different restart times • Components have different failure rates • Use this information to augment Decision Tree • Preserve dependencies • Reduce MTTR • Move slower components up, push faster components down • Capture historical information: Groups of components that fail together • Move high-failure components to single nodes Fall 2001 - CS444A

  7. Restructuring helps! • Sample Restart times for different components Fall 2001 - CS444A

  8. Better RR Tree Pipeline ise istr istu pbcom ise: IServiceEstimator istr: IserviceTracker istu: IServiceTuner fedr: FedRadio pbcom: PipelineByteCOMPort fedr Fall 2001 - CS444A

  9. Making the Decision • Algorithm: • Get a fault, restart the node and children • May not be able to kill the node • Restart may not solve the problem • If this does not fix the problem • Retry a constant number of times • Go up one level • Repeat • Log all faults and restarts Fall 2001 - CS444A

  10. Kill – Restart mechanism • Need for a softer kill • All components may not be misbehaving • Give components a chance to free resources • If soft kill fails, follow with hard kill • kill – 9 system call on linux • Restart implemented as a java System.exec(…) call Fall 2001 - CS444A

  11. Designing a system for RR • Goal is to decrease MTTR • Decompose components into smaller pieces • Advantages • Fault isolation • Move slow-restart pieces up (and fast-restart down) • Significantly decreases MTTR • Example: fedr and pbom4 • Disadvantages • Some components may not be decomposable • IPC can make things difficult (they were together for a reason) – coordination aspect • State management Fall 2001 - CS444A

  12. State Management • Stateful components need to resynchronize after restart • Resynch complexity is a function of system design • GS Resynchronization • All components keep softstate • “Hardstate” in control GUI that we are not modeling here. • Future GS Resynchronization • Protect system goal state in a “safe” stable storage. • Components refresh from this stable storage • Details not yet defined. Fall 2001 - CS444A

  13. Results • Increased reliability in GS through RR • Developed framework for enabling new GS modules • Future work: • Develop protected stable storage techniques • Extend framework for a multi-component GS • Extend framework to a federated Virtual GS Fall 2001 - CS444A

More Related