1 / 27

On the Feasibility of Incremental Checkpointing for Scientific Computing

Explore the feasibility of incremental checkpointing for fault-tolerance in scientific computing, evaluating performance and bandwidth requirements. Learn about methodology, challenges, and scalability of this innovative approach.

lmorgan
Download Presentation

On the Feasibility of Incremental Checkpointing for Scientific Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho jcsancho@lanl.gov with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)

  2. Talk Overview • Goal • Fault-tolerance for Scientific Computing • Methodology • Characterization of Scientific Applications • Performance Evaluation of Incremental Checkpointing • Concluding Remarks

  3. Goal Prove the Feasibility of Incremental Checkpointing • Frequent • Automatic • User-transparent • No changes to application • No special hardware support

  4. Large Scale Computers 133,120 processors 608,256 DRAM • Large component count • Strongly coupled hardware Failure Rate

  5. Scientific Computing • Running for months • Demands high capability Failures expected during the application’s execution

  6. Providing Fault-tolerance • Hardware replication + High cost solution! • Checkpointing and rollback recovery Checkpointing Spare node Recovery

  7. Checkpointing and Recovery Saving process state • Simplicity • Easy implementation • Cost-effective • No additional hardware support • Critical aspect: Bandwidth requirements

  8. Reducing Bandwidth • Incremental checkpointing • Only the memory modified from the previous checkpoint is saved to stable storage Incremental Full Process state

  9. New Challenges More bandwidth pressure • Frequent checkpoints: • Minimizing rollback intervalto increase system availability • Automatic and user-transparent • Autonomic computing • New vision of to manage thehigh complexity of large systems • Self-healing and self-repairing

  10. Survey of Implementation Levels CLIP, Dome, CCITF Application Ickp, CoCheck,Diskless Run-time library Just a few !! Operating system Revive, Safetynet Hardware

  11. Enabling Automatic Checkpointing Checkpoint data User intervention Application Low High Run-time library Operating system automatic Hardware High Low

  12. The Bandwidth Challenge Does the current technology provide enough bandwidth? • Frequent • Automatic

  13. Methodology stack heap static data mmap text • Analyzing the Memory Footprint of Scientific Codes • Run-time library mprotec() Application’sMemory Footprint mprotec()

  14. Methodology • Quantifying the Bandwidth Requirements • Checkpoint intervals: 1s to 20s • Comparing with the current bandwidth available 900 MB/s Sustained network bandwidthQuadrics QsNet II 75 MB/s Single sustained disk bandwidthUltra SCSI controller

  15. Experimental Environment • 32-node Linux Cluster • 64 Itanium II processors • PCI-X I/O bus • Quadrics QsNet interconnection network • Parallel Scientific Codes • Sage • Sweep3D • NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL

  16. Memory Footprint Increasing memory footprint

  17. Talk overview • Goal • Fault-tolerance for scientific computing • Methodology • Characterization of scientific applications • Performance evaluation of Incremental Checkpointing • Bandwidth • Scalability

  18. Characterization Data initialization Regular processing bursts Sage-1000MB

  19. Communication Interleaved Regular communicationbursts Sage-1000MB

  20. Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint

  21. Bandwidth Requirements 78.8MB/s 12.1MB/s Bandwidth (MB/s) Timeslices (s) Decreases with the timeslices Sage-1000MB

  22. Bandwidth Requirementsfor 1 second Most demanding Increases with memory footprint Single SCSI disk performance

  23. Increasing Memory Footprint Size Increases sublinearly AverageBandwidth (MB/s) Timeslices (s)

  24. Increasing Processor Count Decreases slightly with processor count AverageBandwidth (MB/s) Weak-scaling Timeslices (s)

  25. Technological Trends Increases at a faster pace Performance of applications bounded by memory improvements Performance Improvement per year

  26. Conclusions • No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing • Current hardware technology can sustain the bandwidth requirements • These results can be generalized to future large scale computers

  27. Conclusions • The process bandwidth decreases slightly with processor count • Increases sublinearly with the memory footprint size • Improvements in networking and storage will make incremental checkpointing even more effective in the future

More Related