“ mcr E ngine ” a Scalable Checkpointing System using Data-Aware Aggregation and Compression

“mcrEngine”aScalable Checkpointing System usingData-Aware Aggregation and Compression Tanzima Z. Islam, SaurabhBagchi, Rudolf Eigenmann– Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab.

Background • Checkpoint-restart widely used • MPI applications • Take globally coordinated checkpoints • Application-level checkpoint • High-level I/O format • HDF5, Adios, netCDF etc. • Checkpoint writing • HDF5 checkpoint{ • Group “/”{ • Group “ToyGrp”{ • DATASET “Temperature”{ • DATATYPE H5T_IEEE_F32LE • DATASPACE SIMPLE {(1024) / (1024)} • } • DATASET “Pressure” { • DATATYPE H5T_STD_U8LE • DATASPACE SIMPLE {(20,30) / (20,30)} • }}}} Parallel File System (PFS) Parallel File System (PFS) Parallel File System (PFS) Best compromise but complex Easiest but Contention on PFS Not scalable StructToyGrp{ 1. float Temperature[1024]; 2. short Pressure[20][30]; }; Application I/O Library NetCDF Data-Format API HDF5 NM NN N1 mcrEngine: Data-aware Aggregation & Compression

Impact of Load on PFS at Large Scale • IOR • 78MB of data per process • NN checkpoint transfer • Observations: (-) Large average write time less frequent checkpointing (-) Large average read time poor application performance Average Read Time (s) Average Write Time (s) # of Processes (N) mcrEngine: Data-aware Aggregation & Compression

What is the Problem? • Today’s checkpoint-restart systems will not scale • Increasing number of concurrent transfers • Increasing volume of checkpoint data mcrEngine: Data-aware Aggregation & Compression

Our Contributions • Data-aware aggregation • Reduces the number of concurrent transfers • Improves compressibility of checkpoints • Data-aware compression • Reduces data almost 2x more than simply concatenating them and compressing • Design and develop mcrEngine • NM checkpointing system • Decouples checkpoint transfer logic from applications • Improves application performance mcrEngine: Data-aware Aggregation & Compression

Overview • Background • Problem • Data aggregation & compression • Evaluation mcrEngine: Data-aware Aggregation & Compression

Data-Agnostic Schemes • Agnostic scheme – concatenate checkpoints • Agnostic-block scheme – interleave fixed-size blocks • Observations: (+) Easy (−) Low compression ratio First Phase C1 C1 C2 PFS PFS Gzip Gzip C2 C1 [1-B] C1 [B+1-2B] C1 [1-B] C2 [1-B] C1 [B+1-2B] C2 [B+1-2B] C2 [1-B] C2 [B+1-2B] mcrEngine: Data-aware Aggregation & Compression

Identify Similar Variables Across Processes Aware Scheme • Meta-data: • Name • Data-type • Class: • -- Array, Atomic P0 P1 Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; C1.T C1.P C2.T C2.P Concatenating similar variables C1.T C2.T C1.P C2.P mcrEngine: Data-aware Aggregation & Compression

Aware-Block Scheme • Meta-data: • Name • Data-type • Class: • -- Array, Atomic P0 P1 Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; C1.T C1.P C2.T C2.P First ‘B’ bytes of Temperature Next ‘B’ bytes of Temperature Interleave Pressure Interleaving similar variables mcrEngine: Data-aware Aggregation & Compression

Data-Aware Aggregation & Compression • Aware scheme – concatenate similar variables • Aware-block scheme – interleave similar variables C1.T C2.T C1.P C2.P First Phase Lempel-Ziv Data-type aware compression FPC T P H D Output buffer Gzip Second Phase PFS mcrEngine: Data-aware Aggregation & Compression

How mcrEngine Works • CNC : Compute node component • ANC: Aggregator node component • Rank-order groups, Group size = 4, NM checkpointing T P Meta-data Request T, P Identifies “similar” variables Applies data-aware aggregation and compression CNC CNC T P Request H, D CNC CNC ANC T P H D Group Meta-data Request T, P Gzip CNC H D CNC T P Request H, D CNC PFS Group CNC ANC Meta-data T P H D Request T, P Gzip CNC H D CNC CNC Request H, D Group CNC ANC T P H D Gzip mcrEngine: Data-aware Aggregation & Compression H D

Overview • Background • Problem • Data aggregation & compression • Evaluation mcrEngine: Data-aware Aggregation & Compression

Evaluation • Applications • ALE3D – 4.8GB per checkpoint set • Cactus – 2.41GB per checkpoint set • Cosmology – 1.1GB per checkpoint set • Implosion – 13MB per checkpoint set • Experimental test-bed • LLNL’s Sierra: 261.3 TFLOP/s, Linux cluster • 15,408 cores, 1.3 Petabyte Lustre file system • Compression algorithm • FPC [1] for double-precision float • Fpzip [2] for single-precision float • Lempel-Ziv for all other data-types • Gzip for general-purpose compression mcrEngine: Data-aware Aggregation & Compression

Evaluation Metrics • Effectiveness of data-aware compression • What is the benefit of multiple compression phases? • How does group size affect compression ratio? • How does compression ratio change as a simulation progresses? • Performance of mcrEngine • Overhead of the checkpointing phase • Overhead of the restart phase Uncompressed size Compression ratio = Compressed size mcrEngine: Data-aware Aggregation & Compression

No Benefit with Data-Agnostic Double Compression Multiple Phases of Data-Aware Compressionare Beneficial • Data-type aware compression improves compressibility • First phase changes underlying data format • Data-agnostic double compression is not beneficial • Because, data-format is non-uniform and uncompressible Data-Aware Compression Ratio Data-Agnostic mcrEngine: Data-aware Aggregation & Compression

Impact of Group Size on Compression Ratio • Different merging schemes better for different applications • Larger group size beneficial for certain applications • ALE3D: Improvement of 8% from group size 2 to 32 ALE3D Cactus Aware-Block Compression Ratio Aware Cosmology Implosion Group size mcrEngine: Data-aware Aggregation & Compression

Data-Aware Technique Always Wins over Data-Agnostic • Data-aware technique always yields better compression ratio than Data-Agnostic technique ALE3D Cactus 98-115% Aware-Block Compression Ratio Aware Cosmology Implosion Agnostic-Block Agnostic Group size mcrEngine: Data-aware Aggregation & Compression

Compression Ratio Follows Course of Simulation • Data-aware technique always yields better compression Cactus Aware-Block Aware Agnostic-Block Agnostic Compression Ratio Cosmology Implosion Simulation Time-steps mcrEngine: Data-aware Aggregation & Compression

Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme mcrEngine: Data-aware Aggregation & Compression

Impact of Aggregation on Scalability • Used IOR • NN: Each process transfers 78MB • NM: Group size 32, 1.21GB per aggregator Average Write Time (sec) N->N Write N->M Write Average Read Time (sec) N->N Read N->M Read mcrEngine: Data-aware Aggregation & Compression

Impact of Data-Aware Compression on Scalability • IOR with NM transfer, groups of 32 processes • Data-aware: 1.2GB, data-agnostic: 2.4GB • Data-aware compression improves I/O performance at large scale • Improvement during write 43% - 70% • Improvement during read 48% - 70% Agnostic-Read Agnostic-Write Agnostic Aware-Read Aware-Write Aware mcrEngine: Data-aware Aggregation & Compression

End-to-End Checkpointing Overhead • 15,408 processes • Group size of 32 for NM schemes • Each process takes a checkpoint • Converts network bound operation into CPU bound one Reduction in checkpointing overhead 87% 51% CPU Overhead Total Checkpointing Overhead (sec) Transfer Overhead to PFS mcrEngine: Data-aware Aggregation & Compression

End-to-End Restart Overhead • Reduced overall restart overhead • Reduced network load and transfer time Reduction in recovery overhead Reduction in I/O overhead 62% 64% Total Recovery Overhead (sec) CPU Overhead Transfer Overhead to PFS 43% 71% mcrEngine: Data-aware Aggregation & Compression

Conclusion • Developed data-aware checkpoint compression technique • Relative improvement in compression ratio up to 115% • Investigated different merging techniques • Evaluated effectiveness using real-world applications • Designed and developed a scalable framework • Implements NM checkpointing • Improves application performance • Transforms checkpointing into CPU bound operation mcrEngine: Data-aware Aggregation & Compression

Contact Information • Tanzima Islam (tislam@purdue.edu) • Website: web.ics.purdue.edu/~tislam Acknowledgement • Purdue: • SaurabhBagchi (sbagchi@purdue.edu) • Rudolf Eigenmann (eigenman@purdue.edu) • Lawrence Livermore National Laboratory • Kathryn Mohror (kathryn@llnl.gov) • Adam Moody (moody20@llnl.gov) • Bronis R. de Supinski (bronis@llnl.gov) mcrEngine: Data-aware Aggregation & Compression

mcrEngine: Data-aware Aggregation & Compression

Backup Slides mcrEngine: Data-aware Aggregation & Compression

[Backup Slide] Failures in HPC • “A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson Breakdown of root causes of failures Breakdown of downtime into root causes mcrEngine: Data-aware Aggregation & Compression

Future Work • Analytical solution to group size selection? • Better way than rank-order grouping? • Variable streaming? • Engineering challenge mcrEngine: Data-aware Aggregation & Compression

References • M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”. • P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”. • L. Reinhold, “QuickLZ”. mcrEngine: Data-aware Aggregation & Compression

“ mcr E ngine ” a Scalable Checkpointing System using Data-Aware Aggregation and Compression