1 / 14

Checkpoint I/O for Peta-scale Applications

Checkpoint I/O for Peta-scale Applications. Xiaosong Ma 1,2 Sudharshan S. Vazhkudai 2 1. Department of Computer Science North Carolina State University 2. Computer Science and Mathematics Division Oak Ridge National Laboratory. Problem Space: Petascale Storage.

Download Presentation

Checkpoint I/O for Peta-scale Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Checkpoint I/O for Peta-scale Applications Xiaosong Ma1,2 Sudharshan S. Vazhkudai 2 1. Department of Computer Science North Carolina State University 2. Computer Science and Mathematics Division Oak Ridge National Laboratory Ma & Vazhkudai, Dagstuhl 2009

  2. Problem Space: Petascale Storage • Peta-scale system challenges for checkpointing • I/O bandwidth bottleneck issues • Tier 1 DOE applications desire 1GB/s for every TF of peak computing • PetaFlop computer needs 1TB/s I/O bandwidth • Current Lustre peak I/O bandwidth for Jaguar (ORNL, #2 in Top500): ~ 284GB/s • In the best case! Ma & Vazhkudai, Dagstuhl 2009

  3. More Problems… • File systems unable to cope with checkpoint requirements • Bursty • Files not necessarily meant to stay • Sub-optimal center performance • Apps unable to checkpoint at desired frequency • Wasted scratch storage space • Significant fraction of operations budget • Increased parallel file system workload -> more failures! • Need ways to • improve application checkpoint performance • Reduce checkpoint size if possible Ma & Vazhkudai, Dagstuhl 2009

  4. Why Do We Need Application-level Checkpoints • With existing techniques such as • live process migration (plus considering cost of restart)? • system-level checkpointing? • Complexity and Portability • Running with spare processes adds complexity • Affects portability, software reliability • Large overlap between restart and output data • Checkpoint data useful beyond fault tolerance • Enabling new setting from certain intermediate state • Incremental job submission Ma & Vazhkudai, Dagstuhl 2009

  5. Why Do We Need to Checkpoint to Disk? • Survey of Tier 1 apps (GTC, S3D, POP, Chimera) and job logs on Jaguar indicate not all memory/core is used • Significant amount of residual memory per core • Checkpoint image sizes < 10% of the total memory usage Ma & Vazhkudai, Dagstuhl 2009

  6. More Application Mem. Footprint Data • Why not checkpoint to neighbors’ memory? • Again, complexity • Interference with applications’ performance optimization • Checkpoint data needed on disk (Courtesy of Scott Klasky, oRNL) Ma & Vazhkudai, Dagstuhl 2009

  7. Additional Problems • Restart data treated same as as result data • Different usage pattern • Different file persistence requirement • Different QoS requirement • No automatic configuration for output frequency • QoS specification (<x% of total time spent on I/O, maximum y computation steps lost …) • Adaptive to system stability Ma & Vazhkudai, Dagstuhl 2009

  8. Opportunities • High aggregate memory and bisection bandwidth • ORNL Jaguar: 568TB/s • Available residue memory • New storage hardware • Solid State Disks (SSDs) Ma & Vazhkudai, Dagstuhl 2009

  9. Aggregation at all levels of HEC Storage Hierarchy • A dedicated storage system, geared towards checkpointing • Aggregating memory resources from compute nodes • Can incorporate other levels of HEC I/O hierarchy: • Node-local storage in mid-size clusters • Desktop grids (e.g., a Condor-like system) • Unused desktop storage in workstations • Potentially useful for system-level checkpointing as well Ma & Vazhkudai, Dagstuhl 2009

  10. stdchk Architecture ([ICDCS 08]) • Benefactors contribute memory or storage space • Manager • Aggregates contributions and maintains metadata • Provides stripe map (benefactor to chunk mapping) for the checkpointing client • Write operations • Checkpoint images split into chunks and written to benefactors • Small scale testing: peak write throughput ~ 700MB/s, sustained ~ 560MB/s on a 10Gb/s cluster interconnect • POSIX file system API Ma & Vazhkudai, Dagstuhl 2009

  11. SSD Main Mem. Head nodes Extended I/O Architecture Staging nodes Compute nodes Compute nodes Small # of dedicated nodes for data processing 1 2 Interconnection network Interconnection network I/O nodes I/O nodes Simulation job Simulation job • In-job data multiplexing • Async I/O • In-situ vizualization • Data cleasing/reduction • Data analytics • File format transition 3 SAN SAN Potential staging grounds Creating staging ground Remote mem/SSD access I/O multiplexing & QoS control Post-processing job on user’s local cluster (Viz., data processing and analytics, etc.) 3 1 2 Traditional parallel I/O Parallel I/O with multi-level data staging Ma & Vazhkudai, Dagstuhl 2009

  12. Optimizations • Reducing checkpoint size • Data compression • Generic or domain-specific • Incremental checkpointing • Detect similarity between successive checkpoint images • Coordination between multiple I/O operations • I/O scheduling • Storage virtualization Ma & Vazhkudai, Dagstuhl 2009

  13. Summary • Checkpointing will continue to be important HPC I/O component • However, existing HEC I/O stack not initially created for parallel applications’ • I/O scalability problems makes checkpointing more bottleneck-prone • Ample software/hardware opportunities exist • Making scientists’ job easier is key Ma & Vazhkudai, Dagstuhl 2009

  14. More Optimizations (work-in-progress) • Draining to stable storage • Writes as fast as the ability to mask this operation • Pruning of checkpoint files • Purge images from previous interval once the current image is stored safely • File system is unable to perform such optimizations Ma & Vazhkudai, Dagstuhl 2009

More Related