1 / 22

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package. Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber National Center for Supercomputing Applications University of Illinois, Urbana-Champaign. Outline.

niran
Download Presentation

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber National Center for Supercomputing Applications University of Illinois, Urbana-Champaign

  2. Outline • HDF5 and PnetCDF performance comparison • Performance of collective and independent I/O operations • Collective I/O optimizations inside HDF5

  3. Introduction • HDF5 and netCDF provide data formats and programming interfaces • HDF5 • Hierarchical file structures • High flexibility for data management • NetCDF • Linear data layout • Self-descriptive and minimizes overhead • HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO

  4. HDF5 and PnetCDF performance comparison • Previous Study: • PnetCDF achieves higher performance than HDF5 • NCAR Bluesky • IBM Cluster 1600 • AIX 5.1, GPFS • Power4 • LLNL uP • IBM SP • AIX 5.3, GPFS • Power5 • PnetCDF 1.0.1 vs. HDF5 1.6.5.

  5. HDF5 and PnetCDF performance comparison • Benchmark is the I/O kernel of FLASH. • FLASH I/O generates 3D blocks of size 8x8x8 on Bluesky and 16x16x16 on uP. • Each processor handles 80 blocks and writes them into 3 output files. • The performance metric given by FLASH I/O is the parallel execution time. • The more processors, the larger the amount of data. • Collective vs. Independent I/O operations

  6. HDF5 and PnetCDF performance comparison Bluesky uP

  7. HDF5 and PnetCDF performance comparison Bluesky uP

  8. Performance of collective and independent I/O operations • Effects of using contiguous and chunked layout storage. • Four test cases: • Writing a rectangular selection of double floats. • Different access patterns and I/O types. • Data is stored in row-major order. • The tests executed on NCAR Bluesky using HDF5 version 1.6.5.

  9. ……. P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Contiguous storage (Case 1) • Every processor has a noncontiguous selection. • Access requests are interleaved. • Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.) • Independent I/O: 1,659.48 s. • Collective I/O: 4.33 s. Row 1 Row 2

  10. P0 P1 P2 P3 P0 P1 P2 P3 Contiguous storage (Case 2) • No interleaved requests to combine. • Contiguity is already established. • Write operation with 32 processors where each processor selection has 16K rows and 256 columns (32 MB/proc.) • Independent I/O: 3.64 s. • Collective I/O: 3.87 s.

  11. P0 P0 P1 P1 …. P2 P2 …. …. P3 P3 …. chunk 0 chunk 1 P0 P1 P2 P3  dataset Chunked storage (Case 3) • The chunk size does not divide exactly the array size. • If info about access pattern is given, MPI-IO can performs efficient independent access. • Write operation with 32 processors, each processor selection has 16K rows and 256 columns (32 MB/proc.) • Independent I/O: 43.00 s. • Collective I/O: 22.10 s.

  12. P0 P1 P2 P3 chunk 0 chunk 1 P0 P4 P1 P5 P2 P6 P3 P7 Chunked storage (Case 4) • Array partitioned exactly into two chunks. • Write operation with 32 processors, each chunk is selected by 4 processors, each processor selection has 256K rows and 32 columns (64 MB/proc.) • Independent I/O: 7.66 s. • Collective I/O: 8.68 s.

  13. Conclusions • Performance is quite comparable when the two libraries are used in similar manners. • Collective access should be preferred in most situations when using HDF5 1.6.5. • Use of chunked storage may affect the contiguity of the data selection.

  14. Improvements of collective IO supports inside HDF5 • Advanced HDF5 feature: non-regular selections • Performance optimizations: chunked storage • Provide several IO options to achieve good collective IO performance • Provide APIs for applications to participate the optimization process

  15. Improvement 1- non-regular selections • Only one HDF5 IO call • Good for collective IO • Any such applications?

  16. Performance issues for chunked storage • Previous support • Collective IO can only be done per chunk • Problem • Severe performance penalties with many small chunk IOs

  17. Improvement 2: One linked chunk IO chunk 1 chunk 2 chunk 3 chunk 4 P0 P1 MPI-IO Collective View

  18. Improvement 3: Multi-chunk IO Optimization • Have to keep the option to do collective IO per chunk • Collective IO bugs inside different MPI-IO packages • Limitation of system memory • Problem • Bad performance caused by improper use of collective IO P0 P5 P0 P5 P1 P6 P1 P6 P3 P7 P3 P7 P4 P8 P4 P8

  19. Improvement 4 • Problem • HDF5 may make wrong decision about the way to do collective IO • Solution • Provide APIs for applications to participate the decision-making process

  20. Flow chart of Collective Chunking IO improvements inside HDF5 Collective chunk mode Optional User Input Decision-making about One linked-chunk IO Optional User Input NO Multi-chunk IO Decision-making about Collective IO Yes NO Yes One Collective IO call for all chunks Collective IO per chunk Independent IO

  21. Conclusions • HDF5 provides collective IO supports for non-regular selections • Supporting collective IO for chunked storage is not trivial. Several IO options are provided. • Users can participate the decision-making process that selects different IO options.

  22. Acknowledgments • This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.

More Related