Design and Evaluation of Non-Blocking Collective I/O Operations

Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan1, Edgar Gabriel1 1 Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston <venkates, gabriel>@cs.uh.edu Vishwanath Venkatesan

Outline • I/O Challenge in HPC • MPI File I/O • Non-blocking Collective Operations • Non-blocking Collective I/O Operations • Experimental results • Conclusion Vishwanath Venkatesan

I/O Challenge in HPC • A 2005 Paper from LLNL [1] states • Applications on leadership class machines require 1 GB/s I/O Bandwidth per teraflop of computing capability • Jaguar of ORNL , (Fastest in 2008) • Excess 250 Teraflops peak compute performance with peak I/O performance of 72 GB/s [3] • Fastest Supercomputer K, (2011) • 10 Petaflops (nearly) peak compute performance with realized I/O bandwidth of 96 GB/s [2] [1] Richard Hedges, Bill Loewe, T. McLarty, and Chris Morrone. Parallel File System Testing for the Lunatic Fringe: the care and feeding of restless I/O Power Users, In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (2005) [2] Shinji Sumimoto. An Overview of Fujitsu’s Lustre Based File System. Technical report, Fujitsu, 2011 [3] M. Fahey, J. Larkin, and J. Adams. I/O performance on a massively parallel Cray XT3/XT4. In Parallel and Distributed Processing, Vishwanath Venkatesan

MPI File I/O • MPI has been de-facto standard for parallel programming in the last decade • MPI I/O • File view: portion of a file visible to a process • Individual and collective I/O operations • Example to illustrate the advantage of collective I/O ` • 4 processes accessing a 2D matrix stored in row-major format • MPI-I/O can detect this access pattern and issue one large I/O request followed by a distribution step for the data among the processes Vishwanath Venkatesan

Non-blocking Collective Operations • Non-blocking Point-to-Point Operations • Asynchronous data transfer operation • Hide communication latency by overlapping with computation • Demonstrated benefits for a number of applications [1] • Non-blocking collective communication operations were implemented using LibNBC [2] • Schedule based design: a process-local schedule of p2p operations is created • Schedule execution is represented as a state machine (with dependencies) • State and schedule are attached to every request • Non-blocking collective communication operations voted into the upcoming MPI-3 specification [2] • Non-blocking collective I/O operations not (yet) added to the document. [1] Buettner. D, Kunkel. J, and Ludwig. T. 2009. Using Non-blocking I/O Operations in High Performance Computing to Reduce Execution Times. In Proceedings of the 16th European PVM/MPI Users [2] Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI, Supercomputing 2007/ Vishwanath Venkatesan

Non-blocking collective I/O Operations MPI_File_iwrite_all(MPI_Filefile,void*buf, intcnt, MPI_Datatyepdt,MPI_Request*request) • Different from Non-blocking collective communication operations • Every process is allowed to provide different amounts of data per collective read/write operation • No process has a ‘global’ view how much data is read/written • Create a schedule for a non-blocking All-gather(v) • Determine the overall amount of data written across all processes • Determine the offsets for each data item within each group • Upon completion: • Create a new schedule for the shuffle and I/O steps • Schedule can consist of multiple cycles Vishwanath Venkatesan

Experimental Evaluation • Crill cluster at the University of Houston • Distributed PVFS2 file system using with 16 I/O servers • 4x SDR InfiniBand message passing network (2 ports per node) • Gigabit Ethernet I/O network • 18 nodes, 864 compute cores • LibNBC integrated with OpenMPI trunk rev. 24640 • Focusing on collective write operations Vishwanath Venkatesan

Latency I/O Overlap Tests • Overlapping non-blocking coll. I/O operation with equally expensive compute operation • Best case: overall time = max (I/O time, compute time) • Strong dependence on ability to make progress • Best case: time between subsequent calls to NBC_Test = time to execute one cycle of coll. I/O Vishwanath Venkatesan

Parallel Image Segmentation Application • Used to assist in diagnosing thyroid cancer • Based on microscopic images obtained through Fine Needle Aspiration (FNA) [1] • Executes convolution operation for different filters and writes data • Code modified to overlap write of iteration i with computations of iteration i+1 • Two code versions generated: • NBC: Additional calls to progress engine added between different code blocks • NBC w/FFTW: Modified FFTW to insert further calls to progress engine [1] Edgar Gabriel,Vishwanath Venkatesan and ShishirShah, Towards High Performance Cell Segmentation in Multispectral Fine Needle Aspiration Cytology of Thyroid Lesions.Computer Methods and Programs in Biomedicine, 2009. Vishwanath Venkatesan

Application Results • 8192 x 8192 pixels, 21 spectral channels • 1.3 GB input data, ~3 GB output data • 32 aggregators with 4 MB cycle buffer size Vishwanath Venkatesan

Conclusions • Specification of non-blocking collective I/O operations straight forward • Implementation challenging, but doable • Results show strong dependence on the ability to make progress • (Nearly) perfect for micro benchmark • Mostly good results with application scenario • Is up for first voting in the MPI Forum. Vishwanath Venkatesan

Design and Evaluation of Non-Blocking Collective I/O Operations