330 likes | 555 Views
HDF5 collective chunk IO. A Working Report. Motivation for this project. Found extremely bad performance of parallel HDF5 when implementing WRF-Parallel HDF5 IO module with chunking storage. Found that parallel HDF5 does not support MPI-IO collective write and read.
E N D
HDF5 collective chunk IO A Working Report
Motivation for this project • Found extremely bad performance of parallel HDF5 when implementing WRF-Parallel HDF5 IO module with chunking storage. • Found that parallel HDF5 does not support MPI-IO collective write and read. • Had some time left in MEAD project.
Why collective chunk IO? • Why using chunk storage? 1. Better performance when subsetting 2. Dataset with unlimited dimension 3. Filters to be added • Why collective IO? Take advantage of the performance optimization provided by MPI-IO.
MPI-IO Basic Concepts • collective IO Contrary to independent IO, all processes must participate in doing IO. MPI-IO can do optimization to improve IO performance by using MPI_FILE_SET_VIEW with collective IO.
An Example with 4 processes P0’s view P1’s view P2’s view P3’s view When doing independent IO, for worst case it may require 8 individual IO access.
With collective IO P0 P1 P2 P3 It may only need one IO access to the disk. check http://hdf.ncsa.uiuc.edu/apps/WRF-ROMS/parallel-netcdf.pdf and the reference of that report for more information.
Challenges to support collective IO with chunking storage inside HDF5 • Have to fully understand how chunking is implemented inside HDF5. • Have to fully understand how MPI-IO is supported inside HDF5, especially how collective IO works with contiguous storage inside HDF5. • Have to find out how difficult to implement collective chunk IO inside HDF5.
Strategy to do the project • First to see whether we can implement collective chunk IO for some special cases such as one big chunk to cover all singular hyperslab selections. • Then to gradually increase the complexity of the problem until we can solve the general case.
Case 1: One chunk covers all singular hyperslabs P0 P1 P2 P3 All selection in a chunk
Progresses made so far • Unexpected easy connection between HDF5 chunk code and collective IO code. • Found that the easy connection works for more general test cases than I expected. • Wrote the test-suite and checked into HDF5 CVS in both 1.6 and 1.7 branch. • Tackled with more general cases.
Special cases to work with • One chunk to cover all singular hyperslab selections for different processes. • One chunk to cover all regular hyperslab selections for different processes. • All hyperslab selections are singular and the number of chunks inside each hyperslab selection should be the same.
Case 1: One chunk covers all singular hyperslabs P0 P1 P2 P3 All selection in a chunk This case can be used at WRF-PHDF5 module and it was verified that this works.
Case 2: One chunk covers all regular hyperslabs chunk P0 P1 P2 P3 Whether MPI collective chunk IO can optimize this pattern is another question and is out of our discussion.
Case 3: Multiple chunks cover singular hyperslabs each chunk size P0 P1 P2 P3 Condition for this case: number of chunks for each process must be equal.
More general case • Hyperslab does not need to be singular. • One chunk does not need to cover all hyperslab selections for one process. • Number of chunks does NOT have to be the same to cover hyperslab selections for processes. • How about irregular hyperslab selection?
What does it look like? hyperslab selection CHUNK
More details In each chunk, the overall selection becomes irregular, we cannot use the contiguous MPI collective IO code to describe the above shape.
A little more thought • The current HDF5 implementation needs an individual IO access for data stored in each chunk. With large number of chunks, that will cause bad performance. • Can we avoid the above case in Parallel HDF5 layer? • Is it possible to do some optimization and push the above problem into MPI-IO layer?
What should we do • Build MPI derived datatype to describe this pattern for each chunk and we hope that when MPI-IO obtains the whole picture, it will figure out that this is a regular hyperslab selection and do the optimized IO. • To understand how MPI Derived data type works, please check “Derived Data Types with MPI” from http://www.msi.umn.edu/tutorial/scicomp/general/MPI/content6.html at supercomputing institute of University of Minnesota.
MPI Derived Datatype • Why? To provide a portable and efficient way to describe non-contiguous or mixed types in a message. • What? Built from the basic MPI datatypes; A sequence of basic datatypes and displacements.
How to construct the DDT • MPI_Type_contiguous • MPI_Type_vector • MPI_Type_indexed • MPI_Type_struct
MPI_TYPE_INDEXED • parameters: count, blocklens[], offsets[], oldtype, *newtype count: number of blocks to be added blocklens: number of elements in block offsets: displacements for each block oldtype: datatype of each element newtype: handle(pointer) for new derived type
MPI_TYPE_INDEXED count = 2; blocklengths[0] = 4; displacements[0] = 5; blocklengths[1] = 2; displacements[1] = 12; MPI_TYPE_INDEXED(count,blocklengths,displacements,MPI_INT,&indextype);
Approach • We will build a one MPI Derived Data type for each process Use MPI_TYPE_STRUCT or MPI_TYPE_INDEXED to generate the derived data type • Then use MPI_TYPE_STRUCT to generate the final MPI derived data type for each process • Set MPI file set view • Inside MPI-IO layer to let MPI-IO figure out how to optimize this
Approach (continued) • Start with building the “basic” MPI derived data type inside one chunk • Use “basic” MPI derived data types to build an “advanced” MPI derived data type for each process • Use MPI_Set_file_view to glue this together. Done! Build “basic” MPI derived data type PER CHUNK based on selection information Obtain hyperslab Selection information Build “advanced” MPI derived data type PER PROCESS based on “basic” MPI derived data type Set MPI File View based on “advanced” MPI derived data type Send to MPI-IO layer, done
Schematic for MPI Derived Data Types to support collective chunk IO inside Parallel HDF5 chunk 1 chunk i chunk n chunk 2 ... P0 chunk n+2 chunk n+i chunk n+m chunk n+1 ... P1
How to start • HDF5 used Span-tree to implement general hyperslab selection • The starting point is to build an MPI derived data type for irregular hyperslab selection with contiguous layout • After this step is finished, we will build an MPI derived data type for chunk storage following the previous approach
Now a little off-track for the original project • We are trying to build MPI derived data type for irregular hyperslab with contiguous storage. If this is solved, HDF5 can support collective IO for irregular hyperslab selection. • It may also improve the performance for independent IO. • Then we will build an advanced MPI derived data type for chunk storage.
How to describe this hyperslab selection? Span-tree should handle this well.
Span tree handling with the overlapping of the hyperslab + +
Some Performance Hints • It was well-known that Performance was not very good with MPI derived data type. People use MPI PACK and MPI UNPACK in order to gain performance in real application. • The recent performance study shows that using MPI derived data type can achieve comparable performance compared with MPI Pack and MPI Unpack. (http://nowlab.cis.ohio-state.edu/publications/tech-reports/2004/TR19.pdf).