80 likes | 189 Views
Fault Tolerance BOF. Possible CBHPC paper Co-authors wanted Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed to enable application-level FT in (component) applications? Little experience with anything beyond checkpoint/restart (CR) in general
E N D
Fault Tolerance BOF • Possible CBHPC paper • Co-authors wanted • Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David • What infrastructure is needed to enable application-level FT in (component) applications? • Little experience with anything beyond checkpoint/restart (CR) in general • Assume an FT-friendly lower-level environment • Event service for awareness of faults • Ability to request certain behavior from lower-level software • Example: scheduler shouldn’t automatically kill an FT job
Use Cases • MCMD app, 1 node fails • Recovery restarting failed task or ignore failure and go on (self-healing) • MCMD C/R
Checkpoint/Restart Taxonomy • System-level • Eg: BLCR, Cray XT (site option?), but not universally available • Store (complete) memory image to stable storage • Daemon schedules checkpoints • No application (or framework) involvement • Possibly app can request checkpoint • Potential problems: open files, driver state, in-flight messages, etc. • Component i/f to system c/r API • Component support for intelligent reduced checkpointing (MyState interface)
Application-level • Coordinated • Uncoordinated • Causal, Message Logging, etc. • Incremental checkpointing support • Capture component assembly • Checkpoint data component • In-memory (copy or RAID), disk, write-behind, etc., special system services • Quality of Fault Tolerance • What does interface look like? Like RMI • Components to detect faults • Reduced storage (satisfying stability criteria, but not all available data) • Checkpoint-free FT data holders
How to capture/restore state of blackbox components? • Components implement SaveYourself method • Central service invokes SaveYourself on all components that implement it • Serialized data sent to central service for storage • How to restart and restore state? • Not all components will implement SaveYourself • May not have state to store • May be error • Check at start of execution and notify user • RestoreYourself • Specify state (in SIDL file?) and auto-gen serialize/unserialize methods
Another idea • Components register their state data with a central service • Breaks encapsulation and OOness, but not a major violation • Could be higher performance
Recovery • Local restore vs global restore? • Rollback • Need to save execution path at which checkpoint was taken? • Put responsibility on app components? • Extend GoPort abstraction to include save/restore/restart? • Framework tells components to restart • What order? Shouldn’t matter
Paper themes • How components can help with FT • Abstracting FT services into reusable components • Abstracting FT requirements into ports • How components make FT more complicated • Don’t have monolithic view of application (state) • Going beyond CR