140 likes | 253 Views
Checkpoint I/O for Peta-scale Applications. Xiaosong Ma 1,2 Sudharshan S. Vazhkudai 2 1. Department of Computer Science North Carolina State University 2. Computer Science and Mathematics Division Oak Ridge National Laboratory. Problem Space: Petascale Storage.
E N D
Checkpoint I/O for Peta-scale Applications Xiaosong Ma1,2 Sudharshan S. Vazhkudai 2 1. Department of Computer Science North Carolina State University 2. Computer Science and Mathematics Division Oak Ridge National Laboratory Ma & Vazhkudai, Dagstuhl 2009
Problem Space: Petascale Storage • Peta-scale system challenges for checkpointing • I/O bandwidth bottleneck issues • Tier 1 DOE applications desire 1GB/s for every TF of peak computing • PetaFlop computer needs 1TB/s I/O bandwidth • Current Lustre peak I/O bandwidth for Jaguar (ORNL, #2 in Top500): ~ 284GB/s • In the best case! Ma & Vazhkudai, Dagstuhl 2009
More Problems… • File systems unable to cope with checkpoint requirements • Bursty • Files not necessarily meant to stay • Sub-optimal center performance • Apps unable to checkpoint at desired frequency • Wasted scratch storage space • Significant fraction of operations budget • Increased parallel file system workload -> more failures! • Need ways to • improve application checkpoint performance • Reduce checkpoint size if possible Ma & Vazhkudai, Dagstuhl 2009
Why Do We Need Application-level Checkpoints • With existing techniques such as • live process migration (plus considering cost of restart)? • system-level checkpointing? • Complexity and Portability • Running with spare processes adds complexity • Affects portability, software reliability • Large overlap between restart and output data • Checkpoint data useful beyond fault tolerance • Enabling new setting from certain intermediate state • Incremental job submission Ma & Vazhkudai, Dagstuhl 2009
Why Do We Need to Checkpoint to Disk? • Survey of Tier 1 apps (GTC, S3D, POP, Chimera) and job logs on Jaguar indicate not all memory/core is used • Significant amount of residual memory per core • Checkpoint image sizes < 10% of the total memory usage Ma & Vazhkudai, Dagstuhl 2009
More Application Mem. Footprint Data • Why not checkpoint to neighbors’ memory? • Again, complexity • Interference with applications’ performance optimization • Checkpoint data needed on disk (Courtesy of Scott Klasky, oRNL) Ma & Vazhkudai, Dagstuhl 2009
Additional Problems • Restart data treated same as as result data • Different usage pattern • Different file persistence requirement • Different QoS requirement • No automatic configuration for output frequency • QoS specification (<x% of total time spent on I/O, maximum y computation steps lost …) • Adaptive to system stability Ma & Vazhkudai, Dagstuhl 2009
Opportunities • High aggregate memory and bisection bandwidth • ORNL Jaguar: 568TB/s • Available residue memory • New storage hardware • Solid State Disks (SSDs) Ma & Vazhkudai, Dagstuhl 2009
Aggregation at all levels of HEC Storage Hierarchy • A dedicated storage system, geared towards checkpointing • Aggregating memory resources from compute nodes • Can incorporate other levels of HEC I/O hierarchy: • Node-local storage in mid-size clusters • Desktop grids (e.g., a Condor-like system) • Unused desktop storage in workstations • Potentially useful for system-level checkpointing as well Ma & Vazhkudai, Dagstuhl 2009
stdchk Architecture ([ICDCS 08]) • Benefactors contribute memory or storage space • Manager • Aggregates contributions and maintains metadata • Provides stripe map (benefactor to chunk mapping) for the checkpointing client • Write operations • Checkpoint images split into chunks and written to benefactors • Small scale testing: peak write throughput ~ 700MB/s, sustained ~ 560MB/s on a 10Gb/s cluster interconnect • POSIX file system API Ma & Vazhkudai, Dagstuhl 2009
SSD Main Mem. Head nodes Extended I/O Architecture Staging nodes Compute nodes Compute nodes Small # of dedicated nodes for data processing 1 2 Interconnection network Interconnection network I/O nodes I/O nodes Simulation job Simulation job • In-job data multiplexing • Async I/O • In-situ vizualization • Data cleasing/reduction • Data analytics • File format transition 3 SAN SAN Potential staging grounds Creating staging ground Remote mem/SSD access I/O multiplexing & QoS control Post-processing job on user’s local cluster (Viz., data processing and analytics, etc.) 3 1 2 Traditional parallel I/O Parallel I/O with multi-level data staging Ma & Vazhkudai, Dagstuhl 2009
Optimizations • Reducing checkpoint size • Data compression • Generic or domain-specific • Incremental checkpointing • Detect similarity between successive checkpoint images • Coordination between multiple I/O operations • I/O scheduling • Storage virtualization Ma & Vazhkudai, Dagstuhl 2009
Summary • Checkpointing will continue to be important HPC I/O component • However, existing HEC I/O stack not initially created for parallel applications’ • I/O scalability problems makes checkpointing more bottleneck-prone • Ample software/hardware opportunities exist • Making scientists’ job easier is key Ma & Vazhkudai, Dagstuhl 2009
More Optimizations (work-in-progress) • Draining to stable storage • Writes as fast as the ability to mask this operation • Pruning of checkpoint files • Purge images from previous interval once the current image is stored safely • File system is unable to perform such optimizations Ma & Vazhkudai, Dagstuhl 2009