Scalable Fault Tolerance for Petascale Systems 3/20/2008

Performance Measures x.x, x.x, and x.x Scalable Fault Tolerance for Petascale Systems3/20/2008 Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC Science & Technology Principal Directorate - Computation Directorate

Enabling Fault Tolerance for Petascale Systems • Problem: • Reliability key concern for petascale systems • Current fault tolerance approaches scale poorly, use significant I/O bandwidth • Deliverables: • Efficient application checkpointing software for upcoming petascale systems • High-performance I/O system designs for future petascale systems • Ultimate objective:Reliable software on unreliable petascale hardware Science & Technology Principal Directorate - Computation Directorate

Our team has extensive experience implementing scalable fault tolerance and compression techniques • Funding Request: $500k/year (none from other directorates) • Team members: • Peter Lindstrom(.25FTE): Floating Point Compression • Adam Moody(.5FTE): Checkpointing/HPC Systems • Martin Schulz(.25FTE): Checkpointing/HPC Systems • Greg Bronevetsky(.25FTE): Checkpointing/Soft Errors • External collaborators (anticipated): • Sally McKee (Cornell University) Science & Technology Principal Directorate - Computation Directorate

Checkpoints on current systems are limited by the I/O bottleneck • BG/L: 20 minutes per checkpoint (pre-upgrade) • Zeus: 26 minutes • Argonne BG/P: 30 minutes (target) • Current Practice: Drinking the ocean through a straw • Alternative: Flash, disks on compute network, I/O nodes • Extra level of cache between compute nodes, parallel file system Compute Network I/O Nodes ParallelFile System Compute Network I/O Nodes Storage Elements To parallel file system: 80 minutesTo local disks: 1 minute • Thunder checkpoint: Science & Technology Principal Directorate - Computation Directorate

Checkpoint scalability must be improved to support coming systems such as Sequoia • Checkpoint Size Reduction • Incremental Checkpointing • Save only state that changed since last checkpoint • Changes detected via runtime or compiler • Checkpoint Compression • Floating point-specific • Sensitive to relationships between data • Scalable Checkpoint Coordination • Checkpoint Size Reduction • Scalable Checkpoint Coordination • Checkpoint Size Reduction • Scalable Checkpoint Coordination • Subsets of processors checkpoint together • I/O pressure spread evenly over time Science & Technology Principal Directorate - Computation Directorate

Application-specific APIs will enable novel fault tolerance solutions like those used in ddcMD • Application semantics improve performance • Programmers can identify • Data that doesn’t need to be saved • Types of data structures Key for high-performance compression • Matrix relationships Recomputation vs storage • Fault detection algorithms • Critical for soft errors • Ex: ddcMD corrects cache errors on BG/L Science & Technology Principal Directorate - Computation Directorate

Our project will create a paradigm shift in LLNL application reliability • LLNL practice: Users write own checkpointing code • Wastes programmer time • Checkpointing at global barriers unscalable • Current automated solutions do not scale • Very large checkpoints • No information about application • This project will: • Match I/O demands to I/O capacity • Minimize programmer effort • Scale checkpointing to petascale systems • Enable application-specific fault tolerance solutions Science & Technology Principal Directorate - Computation Directorate

Fault tolerance is critical forSequoia and all future platforms • CAR S&T Strategy 1.1: “Perform the research to develop new algorithms that can best exploit likely HPC hardware characteristics, including … fault-tolerant algorithms that can withstand processor failure” • Project enables application fault tolerance • Target audience: application developers • pf3d uses Adam Moody’s in-memory checkpointer • ddcMD implements complex error tolerance schemes • Deliverables: • Efficient application checkpointing software for upcoming petascale systems (e.g. Sequoia) • High-performance I/O system designs for future petascale systems • Application-specific fault tolerance APIs Science & Technology Principal Directorate - Computation Directorate

Scalable Fault Tolerance for Petascale Systems 3/20/2008

Scalable Fault Tolerance for Petascale Systems 3/20/2008

Presentation Transcript

Fault Tolerance in Distributed Systems

Fault Tolerance PetaScale Systems: Current Knowledge, Challenges and Opportunities

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Fault Tolerance in Embedded Systems

Fault Tolerance

Introspective Fault Tolerance for Exascale Systems

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance in Distributed Systems

Fault Tolerance in Distributed Systems

Fault Tolerance