270 likes | 360 Views
On the Feasibility of Incremental Checkpointing for Scientific Computing. Jose Carlos Sancho jcsancho@lanl.gov with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL). Talk Overview. Goal Fault-tolerance for Scientific Computing
E N D
On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho jcsancho@lanl.gov with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)
Talk Overview • Goal • Fault-tolerance for Scientific Computing • Methodology • Characterization of Scientific Applications • Performance Evaluation of Incremental Checkpointing • Concluding Remarks
Goal Prove the Feasibility of Incremental Checkpointing • Frequent • Automatic • User-transparent • No changes to application • No special hardware support
Large Scale Computers 133,120 processors 608,256 DRAM • Large component count • Strongly coupled hardware Failure Rate
Scientific Computing • Running for months • Demands high capability Failures expected during the application’s execution
Providing Fault-tolerance • Hardware replication + High cost solution! • Checkpointing and rollback recovery Checkpointing Spare node Recovery
Checkpointing and Recovery Saving process state • Simplicity • Easy implementation • Cost-effective • No additional hardware support • Critical aspect: Bandwidth requirements
Reducing Bandwidth • Incremental checkpointing • Only the memory modified from the previous checkpoint is saved to stable storage Incremental Full Process state
New Challenges More bandwidth pressure • Frequent checkpoints: • Minimizing rollback intervalto increase system availability • Automatic and user-transparent • Autonomic computing • New vision of to manage thehigh complexity of large systems • Self-healing and self-repairing
Survey of Implementation Levels CLIP, Dome, CCITF Application Ickp, CoCheck,Diskless Run-time library Just a few !! Operating system Revive, Safetynet Hardware
Enabling Automatic Checkpointing Checkpoint data User intervention Application Low High Run-time library Operating system automatic Hardware High Low
The Bandwidth Challenge Does the current technology provide enough bandwidth? • Frequent • Automatic
Methodology stack heap static data mmap text • Analyzing the Memory Footprint of Scientific Codes • Run-time library mprotec() Application’sMemory Footprint mprotec()
Methodology • Quantifying the Bandwidth Requirements • Checkpoint intervals: 1s to 20s • Comparing with the current bandwidth available 900 MB/s Sustained network bandwidthQuadrics QsNet II 75 MB/s Single sustained disk bandwidthUltra SCSI controller
Experimental Environment • 32-node Linux Cluster • 64 Itanium II processors • PCI-X I/O bus • Quadrics QsNet interconnection network • Parallel Scientific Codes • Sage • Sweep3D • NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL
Memory Footprint Increasing memory footprint
Talk overview • Goal • Fault-tolerance for scientific computing • Methodology • Characterization of scientific applications • Performance evaluation of Incremental Checkpointing • Bandwidth • Scalability
Characterization Data initialization Regular processing bursts Sage-1000MB
Communication Interleaved Regular communicationbursts Sage-1000MB
Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint
Bandwidth Requirements 78.8MB/s 12.1MB/s Bandwidth (MB/s) Timeslices (s) Decreases with the timeslices Sage-1000MB
Bandwidth Requirementsfor 1 second Most demanding Increases with memory footprint Single SCSI disk performance
Increasing Memory Footprint Size Increases sublinearly AverageBandwidth (MB/s) Timeslices (s)
Increasing Processor Count Decreases slightly with processor count AverageBandwidth (MB/s) Weak-scaling Timeslices (s)
Technological Trends Increases at a faster pace Performance of applications bounded by memory improvements Performance Improvement per year
Conclusions • No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing • Current hardware technology can sustain the bandwidth requirements • These results can be generalized to future large scale computers
Conclusions • The process bandwidth decreases slightly with processor count • Increases sublinearly with the memory footprint size • Improvements in networking and storage will make incremental checkpointing even more effective in the future