80 likes | 237 Views
Performance Measures x.x, x.x, and x.x. Scalable Fault Tolerance for Petascale Systems 3/20/2008. Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC. Enabling Fault Tolerance for Petascale Systems. Problem:
E N D
Performance Measures x.x, x.x, and x.x Scalable Fault Tolerance for Petascale Systems3/20/2008 Greg Bronevetsky, Bronis de Supinski, Peter Lindstrom, Adam Moody, Martin Schulz CAR - CASC Science & Technology Principal Directorate - Computation Directorate
Enabling Fault Tolerance for Petascale Systems • Problem: • Reliability key concern for petascale systems • Current fault tolerance approaches scale poorly, use significant I/O bandwidth • Deliverables: • Efficient application checkpointing software for upcoming petascale systems • High-performance I/O system designs for future petascale systems • Ultimate objective:Reliable software on unreliable petascale hardware Science & Technology Principal Directorate - Computation Directorate
Our team has extensive experience implementing scalable fault tolerance and compression techniques • Funding Request: $500k/year (none from other directorates) • Team members: • Peter Lindstrom(.25FTE): Floating Point Compression • Adam Moody(.5FTE): Checkpointing/HPC Systems • Martin Schulz(.25FTE): Checkpointing/HPC Systems • Greg Bronevetsky(.25FTE): Checkpointing/Soft Errors • External collaborators (anticipated): • Sally McKee (Cornell University) Science & Technology Principal Directorate - Computation Directorate
Checkpoints on current systems are limited by the I/O bottleneck • BG/L: 20 minutes per checkpoint (pre-upgrade) • Zeus: 26 minutes • Argonne BG/P: 30 minutes (target) • Current Practice: Drinking the ocean through a straw • Alternative: Flash, disks on compute network, I/O nodes • Extra level of cache between compute nodes, parallel file system Compute Network I/O Nodes ParallelFile System Compute Network I/O Nodes Storage Elements To parallel file system: 80 minutesTo local disks: 1 minute • Thunder checkpoint: Science & Technology Principal Directorate - Computation Directorate
Checkpoint scalability must be improved to support coming systems such as Sequoia • Checkpoint Size Reduction • Incremental Checkpointing • Save only state that changed since last checkpoint • Changes detected via runtime or compiler • Checkpoint Compression • Floating point-specific • Sensitive to relationships between data • Scalable Checkpoint Coordination • Checkpoint Size Reduction • Scalable Checkpoint Coordination • Checkpoint Size Reduction • Scalable Checkpoint Coordination • Subsets of processors checkpoint together • I/O pressure spread evenly over time Science & Technology Principal Directorate - Computation Directorate
Application-specific APIs will enable novel fault tolerance solutions like those used in ddcMD • Application semantics improve performance • Programmers can identify • Data that doesn’t need to be saved • Types of data structures Key for high-performance compression • Matrix relationships Recomputation vs storage • Fault detection algorithms • Critical for soft errors • Ex: ddcMD corrects cache errors on BG/L Science & Technology Principal Directorate - Computation Directorate
Our project will create a paradigm shift in LLNL application reliability • LLNL practice: Users write own checkpointing code • Wastes programmer time • Checkpointing at global barriers unscalable • Current automated solutions do not scale • Very large checkpoints • No information about application • This project will: • Match I/O demands to I/O capacity • Minimize programmer effort • Scale checkpointing to petascale systems • Enable application-specific fault tolerance solutions Science & Technology Principal Directorate - Computation Directorate
Fault tolerance is critical forSequoia and all future platforms • CAR S&T Strategy 1.1: “Perform the research to develop new algorithms that can best exploit likely HPC hardware characteristics, including … fault-tolerant algorithms that can withstand processor failure” • Project enables application fault tolerance • Target audience: application developers • pf3d uses Adam Moody’s in-memory checkpointer • ddcMD implements complex error tolerance schemes • Deliverables: • Efficient application checkpointing software for upcoming petascale systems (e.g. Sequoia) • High-performance I/O system designs for future petascale systems • Application-specific fault tolerance APIs Science & Technology Principal Directorate - Computation Directorate