1 / 30

“ mcr E ngine ” a Scalable Checkpointing System using Data-Aware Aggregation and Compression

“ mcr E ngine ” a Scalable Checkpointing System using Data-Aware Aggregation and Compression. Tanzima Z. Islam, Saurabh Bagchi , Rudolf Eigenmann – Purdue University Kathryn Mohror , Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab. Background.

lainey
Download Presentation

“ mcr E ngine ” a Scalable Checkpointing System using Data-Aware Aggregation and Compression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. “mcrEngine”aScalable Checkpointing System usingData-Aware Aggregation and Compression Tanzima Z. Islam, SaurabhBagchi, Rudolf Eigenmann– Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab.

  2. Background • Checkpoint-restart widely used • MPI applications • Take globally coordinated checkpoints • Application-level checkpoint • High-level I/O format • HDF5, Adios, netCDF etc. • Checkpoint writing • HDF5 checkpoint{ • Group “/”{ • Group “ToyGrp”{ • DATASET “Temperature”{ • DATATYPE H5T_IEEE_F32LE • DATASPACE SIMPLE {(1024) / (1024)} • } • DATASET “Pressure” { • DATATYPE H5T_STD_U8LE • DATASPACE SIMPLE {(20,30) / (20,30)} • }}}} Parallel File System (PFS) Parallel File System (PFS) Parallel File System (PFS) Best compromise but complex Easiest but Contention on PFS Not scalable StructToyGrp{ 1. float Temperature[1024]; 2. short Pressure[20][30]; }; Application I/O Library NetCDF Data-Format API HDF5 NM NN N1 mcrEngine: Data-aware Aggregation & Compression

  3. Impact of Load on PFS at Large Scale • IOR • 78MB of data per process • NN checkpoint transfer • Observations: (-) Large average write time less frequent checkpointing (-) Large average read time poor application performance Average Read Time (s) Average Write Time (s) # of Processes (N) mcrEngine: Data-aware Aggregation & Compression

  4. What is the Problem? • Today’s checkpoint-restart systems will not scale • Increasing number of concurrent transfers • Increasing volume of checkpoint data mcrEngine: Data-aware Aggregation & Compression

  5. Our Contributions • Data-aware aggregation • Reduces the number of concurrent transfers • Improves compressibility of checkpoints • Data-aware compression • Reduces data almost 2x more than simply concatenating them and compressing • Design and develop mcrEngine • NM checkpointing system • Decouples checkpoint transfer logic from applications • Improves application performance mcrEngine: Data-aware Aggregation & Compression

  6. Overview • Background • Problem • Data aggregation & compression • Evaluation mcrEngine: Data-aware Aggregation & Compression

  7. Data-Agnostic Schemes • Agnostic scheme – concatenate checkpoints • Agnostic-block scheme – interleave fixed-size blocks • Observations: (+) Easy (−) Low compression ratio First Phase C1 C1 C2 PFS PFS Gzip Gzip C2 C1 [1-B] C1 [B+1-2B] C1 [1-B] C2 [1-B] C1 [B+1-2B] C2 [B+1-2B] C2 [1-B] C2 [B+1-2B] mcrEngine: Data-aware Aggregation & Compression

  8. Identify Similar Variables Across Processes Aware Scheme • Meta-data: • Name • Data-type • Class: • -- Array, Atomic P0 P1 Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; C1.T C1.P C2.T C2.P Concatenating similar variables C1.T C2.T C1.P C2.P mcrEngine: Data-aware Aggregation & Compression

  9. Aware-Block Scheme • Meta-data: • Name • Data-type • Class: • -- Array, Atomic P0 P1 Group ToyGrp{ float Temperature[1024]; int Pressure[20][30]; }; Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; C1.T C1.P C2.T C2.P First ‘B’ bytes of Temperature Next ‘B’ bytes of Temperature Interleave Pressure Interleaving similar variables mcrEngine: Data-aware Aggregation & Compression

  10. Data-Aware Aggregation & Compression • Aware scheme – concatenate similar variables • Aware-block scheme – interleave similar variables C1.T C2.T C1.P C2.P First Phase Lempel-Ziv Data-type aware compression FPC T P H D Output buffer Gzip Second Phase PFS mcrEngine: Data-aware Aggregation & Compression

  11. How mcrEngine Works • CNC : Compute node component • ANC: Aggregator node component • Rank-order groups, Group size = 4, NM checkpointing T P Meta-data Request T, P Identifies “similar” variables Applies data-aware aggregation and compression CNC CNC T P Request H, D CNC CNC ANC T P H D Group Meta-data Request T, P Gzip CNC H D CNC T P Request H, D CNC PFS Group CNC ANC Meta-data T P H D Request T, P Gzip CNC H D CNC CNC Request H, D Group CNC ANC T P H D Gzip mcrEngine: Data-aware Aggregation & Compression H D

  12. Overview • Background • Problem • Data aggregation & compression • Evaluation mcrEngine: Data-aware Aggregation & Compression

  13. Evaluation • Applications • ALE3D – 4.8GB per checkpoint set • Cactus – 2.41GB per checkpoint set • Cosmology – 1.1GB per checkpoint set • Implosion – 13MB per checkpoint set • Experimental test-bed • LLNL’s Sierra: 261.3 TFLOP/s, Linux cluster • 15,408 cores, 1.3 Petabyte Lustre file system • Compression algorithm • FPC [1] for double-precision float • Fpzip [2] for single-precision float • Lempel-Ziv for all other data-types • Gzip for general-purpose compression mcrEngine: Data-aware Aggregation & Compression

  14. Evaluation Metrics • Effectiveness of data-aware compression • What is the benefit of multiple compression phases? • How does group size affect compression ratio? • How does compression ratio change as a simulation progresses? • Performance of mcrEngine • Overhead of the checkpointing phase • Overhead of the restart phase Uncompressed size Compression ratio = Compressed size mcrEngine: Data-aware Aggregation & Compression

  15. No Benefit with Data-Agnostic Double Compression Multiple Phases of Data-Aware Compressionare Beneficial • Data-type aware compression improves compressibility • First phase changes underlying data format • Data-agnostic double compression is not beneficial • Because, data-format is non-uniform and uncompressible Data-Aware Compression Ratio Data-Agnostic mcrEngine: Data-aware Aggregation & Compression

  16. Impact of Group Size on Compression Ratio • Different merging schemes better for different applications • Larger group size beneficial for certain applications • ALE3D: Improvement of 8% from group size 2 to 32 ALE3D Cactus Aware-Block Compression Ratio Aware Cosmology Implosion Group size mcrEngine: Data-aware Aggregation & Compression

  17. Data-Aware Technique Always Wins over Data-Agnostic • Data-aware technique always yields better compression ratio than Data-Agnostic technique ALE3D Cactus 98-115% Aware-Block Compression Ratio Aware Cosmology Implosion Agnostic-Block Agnostic Group size mcrEngine: Data-aware Aggregation & Compression

  18. Compression Ratio Follows Course of Simulation • Data-aware technique always yields better compression Cactus Aware-Block Aware Agnostic-Block Agnostic Compression Ratio Cosmology Implosion Simulation Time-steps mcrEngine: Data-aware Aggregation & Compression

  19. Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme mcrEngine: Data-aware Aggregation & Compression

  20. Impact of Aggregation on Scalability • Used IOR • NN: Each process transfers 78MB • NM: Group size 32, 1.21GB per aggregator Average Write Time (sec) N->N Write N->M Write Average Read Time (sec) N->N Read N->M Read mcrEngine: Data-aware Aggregation & Compression

  21. Impact of Data-Aware Compression on Scalability • IOR with NM transfer, groups of 32 processes • Data-aware: 1.2GB, data-agnostic: 2.4GB • Data-aware compression improves I/O performance at large scale • Improvement during write 43% - 70% • Improvement during read 48% - 70% Agnostic-Read Agnostic-Write Agnostic Aware-Read Aware-Write Aware mcrEngine: Data-aware Aggregation & Compression

  22. End-to-End Checkpointing Overhead • 15,408 processes • Group size of 32 for NM schemes • Each process takes a checkpoint • Converts network bound operation into CPU bound one Reduction in checkpointing overhead 87% 51% CPU Overhead Total Checkpointing Overhead (sec) Transfer Overhead to PFS mcrEngine: Data-aware Aggregation & Compression

  23. End-to-End Restart Overhead • Reduced overall restart overhead • Reduced network load and transfer time Reduction in recovery overhead Reduction in I/O overhead 62% 64% Total Recovery Overhead (sec) CPU Overhead Transfer Overhead to PFS 43% 71% mcrEngine: Data-aware Aggregation & Compression

  24. Conclusion • Developed data-aware checkpoint compression technique • Relative improvement in compression ratio up to 115% • Investigated different merging techniques • Evaluated effectiveness using real-world applications • Designed and developed a scalable framework • Implements NM checkpointing • Improves application performance • Transforms checkpointing into CPU bound operation mcrEngine: Data-aware Aggregation & Compression

  25. Contact Information • Tanzima Islam (tislam@purdue.edu) • Website: web.ics.purdue.edu/~tislam Acknowledgement • Purdue: • SaurabhBagchi (sbagchi@purdue.edu) • Rudolf Eigenmann (eigenman@purdue.edu) • Lawrence Livermore National Laboratory • Kathryn Mohror (kathryn@llnl.gov) • Adam Moody (moody20@llnl.gov) • Bronis R. de Supinski (bronis@llnl.gov) mcrEngine: Data-aware Aggregation & Compression

  26. mcrEngine: Data-aware Aggregation & Compression

  27. Backup Slides mcrEngine: Data-aware Aggregation & Compression

  28. [Backup Slide] Failures in HPC • “A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson Breakdown of root causes of failures Breakdown of downtime into root causes mcrEngine: Data-aware Aggregation & Compression

  29. Future Work • Analytical solution to group size selection? • Better way than rank-order grouping? • Variable streaming? • Engineering challenge mcrEngine: Data-aware Aggregation & Compression

  30. References • M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”. • P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”. • L. Reinhold, “QuickLZ”. mcrEngine: Data-aware Aggregation & Compression

More Related