Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber National Center for Supercomputing Applications University of Illinois, Urbana-Champaign

Outline • HDF5 and PnetCDF performance comparison • Performance of collective and independent I/O operations • Collective I/O optimizations inside HDF5

Introduction • HDF5 and netCDF provide data formats and programming interfaces • HDF5 • Hierarchical file structures • High flexibility for data management • NetCDF • Linear data layout • Self-descriptive and minimizes overhead • HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO

HDF5 and PnetCDF performance comparison • Previous Study: • PnetCDF achieves higher performance than HDF5 • NCAR Bluesky • IBM Cluster 1600 • AIX 5.1, GPFS • Power4 • LLNL uP • IBM SP • AIX 5.3, GPFS • Power5 • PnetCDF 1.0.1 vs. HDF5 1.6.5.

HDF5 and PnetCDF performance comparison • Benchmark is the I/O kernel of FLASH. • FLASH I/O generates 3D blocks of size 8x8x8 on Bluesky and 16x16x16 on uP. • Each processor handles 80 blocks and writes them into 3 output files. • The performance metric given by FLASH I/O is the parallel execution time. • The more processors, the larger the amount of data. • Collective vs. Independent I/O operations

HDF5 and PnetCDF performance comparison Bluesky uP

Performance of collective and independent I/O operations • Effects of using contiguous and chunked layout storage. • Four test cases: • Writing a rectangular selection of double floats. • Different access patterns and I/O types. • Data is stored in row-major order. • The tests executed on NCAR Bluesky using HDF5 version 1.6.5.

……. P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Contiguous storage (Case 1) • Every processor has a noncontiguous selection. • Access requests are interleaved. • Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.) • Independent I/O: 1,659.48 s. • Collective I/O: 4.33 s. Row 1 Row 2

P0 P1 P2 P3 P0 P1 P2 P3 Contiguous storage (Case 2) • No interleaved requests to combine. • Contiguity is already established. • Write operation with 32 processors where each processor selection has 16K rows and 256 columns (32 MB/proc.) • Independent I/O: 3.64 s. • Collective I/O: 3.87 s.

P0 P0 P1 P1 …. P2 P2 …. …. P3 P3 …. chunk 0 chunk 1 P0 P1 P2 P3  dataset Chunked storage (Case 3) • The chunk size does not divide exactly the array size. • If info about access pattern is given, MPI-IO can performs efficient independent access. • Write operation with 32 processors, each processor selection has 16K rows and 256 columns (32 MB/proc.) • Independent I/O: 43.00 s. • Collective I/O: 22.10 s.

P0 P1 P2 P3 chunk 0 chunk 1 P0 P4 P1 P5 P2 P6 P3 P7 Chunked storage (Case 4) • Array partitioned exactly into two chunks. • Write operation with 32 processors, each chunk is selected by 4 processors, each processor selection has 256K rows and 32 columns (64 MB/proc.) • Independent I/O: 7.66 s. • Collective I/O: 8.68 s.

Conclusions • Performance is quite comparable when the two libraries are used in similar manners. • Collective access should be preferred in most situations when using HDF5 1.6.5. • Use of chunked storage may affect the contiguity of the data selection.

Improvements of collective IO supports inside HDF5 • Advanced HDF5 feature: non-regular selections • Performance optimizations: chunked storage • Provide several IO options to achieve good collective IO performance • Provide APIs for applications to participate the optimization process

Improvement 1- non-regular selections • Only one HDF5 IO call • Good for collective IO • Any such applications?

Performance issues for chunked storage • Previous support • Collective IO can only be done per chunk • Problem • Severe performance penalties with many small chunk IOs

Improvement 2: One linked chunk IO chunk 1 chunk 2 chunk 3 chunk 4 P0 P1 MPI-IO Collective View

Improvement 3: Multi-chunk IO Optimization • Have to keep the option to do collective IO per chunk • Collective IO bugs inside different MPI-IO packages • Limitation of system memory • Problem • Bad performance caused by improper use of collective IO P0 P5 P0 P5 P1 P6 P1 P6 P3 P7 P3 P7 P4 P8 P4 P8

Improvement 4 • Problem • HDF5 may make wrong decision about the way to do collective IO • Solution • Provide APIs for applications to participate the decision-making process

Flow chart of Collective Chunking IO improvements inside HDF5 Collective chunk mode Optional User Input Decision-making about One linked-chunk IO Optional User Input NO Multi-chunk IO Decision-making about Collective IO Yes NO Yes One Collective IO call for all chunks Collective IO per chunk Independent IO

Conclusions • HDF5 provides collective IO supports for non-regular selections • Supporting collective IO for chunked storage is not trivial. Several IO options are provided. • Users can participate the decision-making process that selects different IO options.

Acknowledgments • This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package