Parallel I/O Performance Study

Parallel I/O Performance Study Christian Chilan The HDF Group SPEEDUP Workshop - HDF5 Tutorial

Introduction Parallel performance affected by the I/O access pattern, file system, and MPI communication modes. Determination of interaction of these elements provides hints for improving performance. Study presents four test cases using h5perf and h5perf_serial. • h5perf has been extended to support parallel testing of 2D datasets. • h5perf_serial, based on h5perf, allows serial testing of n-dimensional datasets and various file drivers. Testing includes various combinations of MPI communication modes and HDF5 storage layouts. Finally, we make recommendations that can improve the I/O performance for specific patterns. SPEEDUP Workshop - HDF5 Tutorial

Testing Systems and Configuration SPEEDUP Workshop - HDF5 Tutorial

HDF5 Storage Layouts Dataset storage Dataset Contiguous • HDF5 assigns a static contiguous region of storage for raw data. SPEEDUP Workshop - HDF5 Tutorial

HDF5 Storage Layouts C1 C0 C0C1C2C3 C2 C3 Chunked • HDF5 define separate regions of storage for raw data named chunks, which are pre-allocated in row-major order when a file is created in parallel. • This layout is only valid when a file is created and the chunks are pre-allocated. Further modification of the file may cause the chunks to be arranged differently. SPEEDUP Workshop - HDF5 Tutorial

P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3 64K 64K Test Cases 1K Case A • The transfer selections extend over the entire columns with a size of 64K×1K. If the storage is chunked, the size of the chunks is 1K×1K. The selections are interleaved horizontally with respect to the processors. SPEEDUP Workshop - HDF5 Tutorial

2K 32K P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3 64K P0 P1 P2 P3 P0 P1 P2 P3 … P0 P1 P2 P3 64K Test Cases Case B • The transfer selection only spans half the columns with a size of 32K×2K. If the storage is chunked, the size of the chunks is 2K×2K. The selections are interleaved horizontally with respect to the processors. SPEEDUP Workshop - HDF5 Tutorial

32K 2K P0 P0 … P0 P1 P1 … P1 P2 P2 … P2 P3 P3 … P3 P0 P0 … P0 P1 P1 … P1 P2 P2 … P2 P3 P3 … P3 64K 64K Test Cases Case C • The transfer selections only span half the rows with a size of 2K×32K. If the storage is chunked, the size of the chunks is 2K×2K. The lower dimension (column) is evenly divided among the processors. SPEEDUP Workshop - HDF5 Tutorial

64K P0 P0 … P0 P1 P1 … P1 P2 P2 … P2 P3 P3 … P3 1K 64K Test Cases Case D • The transfer selection extends over the entire rows with a size of 1K×64K. If the storage is chunked, the size of the chunks is 1K×1K. The lower dimension (column) is evenly divided among the processors. SPEEDUP Workshop - HDF5 Tutorial

P0 P0 P0 P1 P1 P1 … P2 P2 P2 P3 … … P3 P3 ... Access Patterns Contiguous • Each processor retrieves a separate region of contiguous storage. An example of this pattern is case D using contiguous storage. Non-contiguous • Separate regions are still assigned to each processor but such regions contain gaps. Examples of this pattern include case C using contiguous storage, and collective cases C-D using chunked storage. SPEEDUP Workshop - HDF5 Tutorial

… … P0 P0 P1 P1 P2 P2 P3 P3 P0 P0 P1 P1 P2 P2 P3 P3 Access Patterns Interleaved (or overlapped) • Each processor writes into many portions that are interleaved with respect to the other processors. For example, using contiguous storage along with cases A-B generates • Another instance results from using chunked storage with collective cases A-B SPEEDUP Workshop - HDF5 Tutorial

Performance Results and Analysis The results correspond to maximum throughput values of Write Open-Close operations during 3 iterations. Serial throughput is the performance baseline since our objective is to determine how parallel access can improve performance. Unlike GPFS and CXFS, Lustre does not stripe files by default. To enable parallel access, the directory / file must be striped using the command lfs. SPEEDUP Workshop - HDF5 Tutorial

I/O Performance in Lustre SPEEDUP Workshop - HDF5 Tutorial

I/O Performance in Lustre Striping partitions the file space into stripes and assigns them to several Object Storage Targets (OSTs) in round-robin fashion. Since each OST stores portions of the file that are different from the other OSTs, they all can access the file in parallel. The default configuration on abe uses a stripe size of 4MB and a stripe count of 16. Striping improves performance when the I/O request of each processor spans several stripes (and OSTs) after MPI aggregations, if any. When the processors make small independent I/O requests that are practically contiguous as cases A-B using chunked storage, a single OST can provide better performance due to asynchronous operations. SPEEDUP Workshop - HDF5 Tutorial

I/O Performance SPEEDUP Workshop - HDF5 Tutorial

Performance of Serial I/O Access using contiguous storage has the steepest performance trend as the cases change from A to D. When using chunked storage, the throughput remains almost constant at the upper bound. The allocation of chunks at the time they are written causes the access pattern to be virtually contiguous regardless of the test cases. SPEEDUP Workshop - HDF5 Tutorial

Performance of Independent I/O Processors perform their I/O requests independently from each other. For contiguous storage, performance improves as the tests move from A to D. For chunked storage, throughput is high for interleaved cases A-B since writing blocks (chunks) become larger and caching is exploited. For cases C-D, the many writing requests (one per chunk) multiply the overhead due to unnecessary locking and caching in Lustre and CXFS. Unlike these file systems, GPFS has shown better scalability [1,2]. SPEEDUP Workshop - HDF5 Tutorial

Performance of Collective I/O The participating processors coordinate and combine their many requests into fewer I/O operations reducing latency. Since the file space is evenly divided among the processors, no need for locking which reduces overhead [3]. For contiguous storage, performance is overall high but there is still an increasing trend as the cases change from A to D. For chunked storage, the performance is even higher with minor variations among the tests cases because several chunks can be written with a single I/O operation. SPEEDUP Workshop - HDF5 Tutorial

Conclusion Important to determine the access pattern by analyzing the I/O requirements of the application and the storage implementation. For contiguous access patterns, independent access is preferable because it omits unnecessary overhead of collective calls. For non-contiguous patterns, there is little difference between independent and collective access. However, writing many chunks in independent mode may be expensive in Lustre and CXFS if caching is not exploited. For interleaved access pattern, collective mode is usually faster. For all the access patterns, collective mode and chunk storage provide the combination that yields the highest average performance. SPEEDUP Workshop - HDF5 Tutorial

References J. Borrill, L. Oliker, J. Shalf, and H. Shan. Investigation of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark. In Proceedings of SC’07: High Performance Networking and Computing, Reno, NV, November 2007. W. Liao, A. Ching, K. Coloma, A. Choudhary, and L. Ward. An Implementation and Evaluation of Client-Side File Caching for MPI-IO. In Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2007, IEEE International Volume, Issue 26-30, pages 1-10, March 2007. R. Thakur, W. Gropp, and E. Lusk. Data Sieving and Collective I/O in ROMIO. In Proceedings of the 7th Symposium of the Frontiers of Massively Parallel Computation. IEEE Computer Society Press, February 1999. SPEEDUP Workshop - HDF5 Tutorial

Parallel I/O Performance Study