On the Feasibility of Incremental Checkpointing for Scientific Computing

On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho jcsancho@lanl.gov with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)

Talk Overview • Goal • Fault-tolerance for Scientific Computing • Methodology • Characterization of Scientific Applications • Performance Evaluation of Incremental Checkpointing • Concluding Remarks

Goal Prove the Feasibility of Incremental Checkpointing • Frequent • Automatic • User-transparent • No changes to application • No special hardware support

Large Scale Computers 133,120 processors 608,256 DRAM • Large component count • Strongly coupled hardware Failure Rate

Scientific Computing • Running for months • Demands high capability Failures expected during the application’s execution

Providing Fault-tolerance • Hardware replication + High cost solution! • Checkpointing and rollback recovery Checkpointing Spare node Recovery

Checkpointing and Recovery Saving process state • Simplicity • Easy implementation • Cost-effective • No additional hardware support • Critical aspect: Bandwidth requirements

Reducing Bandwidth • Incremental checkpointing • Only the memory modified from the previous checkpoint is saved to stable storage Incremental Full Process state

New Challenges More bandwidth pressure • Frequent checkpoints: • Minimizing rollback intervalto increase system availability • Automatic and user-transparent • Autonomic computing • New vision of to manage thehigh complexity of large systems • Self-healing and self-repairing

Survey of Implementation Levels CLIP, Dome, CCITF Application Ickp, CoCheck,Diskless Run-time library Just a few !! Operating system Revive, Safetynet Hardware

Enabling Automatic Checkpointing Checkpoint data User intervention Application Low High Run-time library Operating system automatic Hardware High Low

The Bandwidth Challenge Does the current technology provide enough bandwidth? • Frequent • Automatic

Methodology stack heap static data mmap text • Analyzing the Memory Footprint of Scientific Codes • Run-time library mprotec() Application’sMemory Footprint mprotec()

Methodology • Quantifying the Bandwidth Requirements • Checkpoint intervals: 1s to 20s • Comparing with the current bandwidth available 900 MB/s Sustained network bandwidthQuadrics QsNet II 75 MB/s Single sustained disk bandwidthUltra SCSI controller

Experimental Environment • 32-node Linux Cluster • 64 Itanium II processors • PCI-X I/O bus • Quadrics QsNet interconnection network • Parallel Scientific Codes • Sage • Sweep3D • NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL

Memory Footprint Increasing memory footprint

Talk overview • Goal • Fault-tolerance for scientific computing • Methodology • Characterization of scientific applications • Performance evaluation of Incremental Checkpointing • Bandwidth • Scalability

Characterization Data initialization Regular processing bursts Sage-1000MB

Communication Interleaved Regular communicationbursts Sage-1000MB

Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint

Bandwidth Requirements 78.8MB/s 12.1MB/s Bandwidth (MB/s) Timeslices (s) Decreases with the timeslices Sage-1000MB

Bandwidth Requirementsfor 1 second Most demanding Increases with memory footprint Single SCSI disk performance

Increasing Memory Footprint Size Increases sublinearly AverageBandwidth (MB/s) Timeslices (s)

Increasing Processor Count Decreases slightly with processor count AverageBandwidth (MB/s) Weak-scaling Timeslices (s)

Technological Trends Increases at a faster pace Performance of applications bounded by memory improvements Performance Improvement per year

Conclusions • No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing • Current hardware technology can sustain the bandwidth requirements • These results can be generalized to future large scale computers

Conclusions • The process bandwidth decreases slightly with processor count • Increases sublinearly with the memory footprint size • Improvements in networking and storage will make incremental checkpointing even more effective in the future

On the Feasibility of Incremental Checkpointing for Scientific Computing

On the Feasibility of Incremental Checkpointing for Scientific Computing

Presentation Transcript

XML for Scientific Computing

Scientific Computing on

The Challenges of Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

On Using Graphics Hardware for Scientific Computing

Scientific Computing on Graphics Hardware

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Components for Scientific Computing

Scientific Computing

Incremental Checkpointing with Application to Distributed Discrete Event Simulation