80 likes | 182 Views
System-Directed Resilience for Exascale Platforms. LDRD Proposal 09-0016. System-Directed Resilience for Exascale Platforms (09-0016) Ron Oldfield (1423), Neil Pundit (1423), FY09-11, Total $1500 Costs. Problem Current apps cannot survive a node failure
E N D
System-Directed Resilience for Exascale Platforms LDRD Proposal 09-0016
System-Directed Resilience for Exascale Platforms (09-0016)Ron Oldfield (1423), Neil Pundit (1423),FY09-11, Total $1500 Costs • Problem • Current apps cannot survive a node failure • Proposed SolutionApplication-transparent resilience to node failures • Approach • Design/develop system software to support: • Application quiescence, • Efficient state management, • Automatic fault recovery • R&D Goals & Milestones • Investigate and develop new methods for quiescence that don’t hinder other apps. • Identify critical application state and develop efficient methods to manage state. • Identify system software requirements for • dynamic node allocation, • network/os virtualization, and • MPI node recovery. • Relationship to Other Work • Scalability and efficient resource utilization, particularly memory and storage, are key issues for this effort. • Our team has R&D experience in: • Scalable system software (LWK, Portals, LWFS), • Smart memory management techniques (Smartmap) • RAS systems • All efforts developed “lightweight” approaches that are both resource-efficient and scalable. • Significance of Results • Represents a fundamental change in the way HPC systems support resilience. • Significant impact on performance: less defensive I/O overhead for checkpoints. • Higher levels of reliability. • Improved productivity: developers worry less about resilience, more on core science.
Resilience Challenges for Exascale • Current Application characteristics • Require large fractions of systems • Long running • Resource constrained compute nodes • Cannot survive component failure • Current Options for fault tolerance • Application-directed checkpoints • System-directed checkpoints • System-directed incremental checkpoints • Checkpoint in memory • Others: virtualization, redundant computation, … • We propose to develop systems software resilient to node failure • Support for application quiescence, • Efficient (diskless) state management, • Fast methods for fault recovery.
Application Quiescence Goal: Develop methods to suspend application activity without hindering progress of other applications • Requires • Methods for accurate and efficient fault detection • Mechanisms and interfaces for conveying node state to shared services (e.g., need a functional RAS system) • Approach • Integrated system software for cooperation among shared services and applications • Network layer: deal with messages in transit • File system: isolate and suspend in-progress I/O operations
State Management Goal: Efficient methods for extracting and managing state Approach • Identify critical state • Characterize memory usage • Investigate resource-efficient methods for logging modified memory. • App guidance to identify unnecessary data (e.g., ghost cells, cache) • System guidance for when to extract state • Explore diskless methods to manage state • Explore state compression to reduce resource reqs
Fault Recovery Goal: Dynamically recover a failed node without restarting the whole application Approach • Explore changes to system software to support dynamic node allocation (for swap of failed node). • Develop network virtualization to abstract physical node ID from software. • Develop efficient methods for state recovery • Investigate roll-back, roll-forward techniques
Summary • Recovering from independent node failures is a critical issue for exascale systems • We address that problem through modifications to system software • Support for application quiescence, • Efficient (diskless) state management, • Fast methods for fault recovery. Our approach represents a fundamental change in how systems support resilience
Reviewer Questions • Programmatic • Firm commitments from team if LDRD goes forward? • Why is funding flat for FY10 and FY11? • Technical • Is the assertion that “checkpoint overhead will exceed 50% beyond 100K nodes” too modest? • Why use the term “components” instead of cores or processors. • Technical/Programmatic • Can the project really address all of the proposed work? • With 10-11 technical topics have we identified all the technical risks?