Non-Blocking Collective MPI I/O Routines

Non-Blocking Collective MPI I/O Routines Ticket #273

Introduction • I/O is one of the main bottlenecks in HPC applications. • Many applications or higher level libraries rely on MPI-I/O for doing parallel I/O. • Several optimizations have been introduced in MPI-I/O to meet the needs of application • Non-blocking individual I/O • Different collective I/O algorithms

Motivation • Routines for non-blocking individual I/O operations exist (MPI_File_i(read/write)(_at) • Non-blocking point-to-point (existing) and collective (to be added) communication operations have demonstrated benefits. • Split collective I/O operations have their restrictions and limitations. • What’s keeping us from adding non-blocking collective I/O operations? • Implementation

Usecase (I) • HDF5 operations that modify metadata: • Collective to keep the cache among all processes synchronized. • The Metadata cache uses an LRU eviction scheme. • Items at the bottom of the list are evicted in a collective write call to disk. The amount of data written is usually small (< 1KB). • Non blocking collective I/O would allow us to fire off those writes and go do other stuff avoiding the I/O overhead.

Usecase (II) • HDF5 Raw data operations: • Chunking data in file is a key optimization HDF5 uses for parallel I/O. • If HDF5 can detect a pattern in the way chunks are accessed, we can pre-fetch those chunks from disk. • Asynchronous I/O operations would hide the cost of I/O operations. • Chunk cache for writes (currently disabled for parallel HDF5): • Similar concept to the metadata cache

New Routines • MPI_File_iread_all(MPI_Filefh, void *buf, int count, MPI_Datatype type, MPI_Request*req); • MPI_File_iwrite_all(MPI_Filefh, void *buf, int count, MPI_Datatype type, MPI_Request*req); • MPI_File_iread_at_all(MPI_Filefh, MPI_Offset offset, void *buf, int count, MPI_Datatype type, MPI_Request*req); • MPI_File_iwrite_at_all(MPI_Filefh, MPI_Offset offset, void *buf, int count, MPI_Datatype type, MPI_Request*req); • Ordered read/write (add non blocking or deprecate ordered) • Deprecate split collectives • Straw Vote: 22 - 0 - 0

Challenges • Major difference between collective communication and collective I/O operations: • Each process is allowed to provide different volumes of data to a collective I/O operation, without having knowledge on the data volumes provided by other processes. • Collective I/O algorithms do aggregation.

Implementation • Need non-blocking collective communication • Integrate with the progress engine • Test/Wait on the request like other non-blocking operations • Explicit or Implicit progress? • Different collective I/O algorithms

Implementation • A recent implementation was done within an Open MPI specific I/O library (OMPIO) and uses LibNBC: • leverages the same concept of a schedule for non blocking collective communication operations • work is still at preliminary stages so a large scale evaluation is not available • done at PSTL at the University of Houston (Edgar Gabriel) in collaboration with Torsten • paper accepted at EuroMPI 2011: • Design and Evaluation of Nonblocking Collective I/O Operations

Other MPI I/O Operations • Several MPI I/O functions are considered expensive other than read/write functions: • Open/Close • Sync • Set view • Set/Get size • It would be valuable to have non-blocking versions of some of those functions too.

Usecases • Applications that open a file but don’t touch the file until a certain amount of computation has been done • Cost of opening a file will be hidden • Non-blocking sync would also provide great advantages in case we flush data items to disk before we go do computation. • The intention is to hide the cost (whenever possible) of all the expensive MPI I/O operations.

Proposed Routines • MPI_File_iopen (MPI_Commcomm, char* filename, intamode, MPI_Info info, MPI_File *fh, MPI_Request *req); • MPI_File_iclose (MPI_Filefh, MPI_Request *req); • MPI_File_isync (MPI_File file, MPI_Request *req); • MPI_File_iset_view(MPI_Filefh, MPI_Offsetdisp, MPI_Datatypeetype, MPI_Datatypefiletype, char *datarep, MPI_Info info, MPI_Request *req); • MPI_File_iset_size (MPI_Filefh, MPI_Offset size, MPI_Request *req); • MPI_File_ipreallocate (MPI_Filefh, MPI_Offset size, MPI_Request *req); • MPI_File_iset_info ( MPI_Filefh,MPI_Infoinfo, MPI_Request *req); • Straw Vote: 15 – 1 – [5(need to think), 1(doesn’t care)]

Conclusion • The need for non-blocking collective I/O is fairly high. • Implementation is the non-easy part. • Performance benefits can be substantial. • Users would also benefit from non-blocking versions of some MPI I/O operations that are considered fairly time consuming.

Non-Blocking Collective MPI I/O Routines