Pursuing Faster I/O in COSMO

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010

Recap –why I/O is such a problem • Idealised 2D grid layout: • Increasing the number of processors by 4 leads to each processor having • one quarter the number of grid points to compute • one half the number of halo points to communicate • The same amount of total data needs to be output at each time step. P processors, each with … MxN Grid points 2M+2N Halo points 4P processors, each with … (M/2)x(N/2) Grid points M+N Halo points • The I/O problem: • I/O is the limiting factor for scaling of COSMO • Limiting factor for many data intensive applications • Speed of I/O subsystems for writing data is not keeping up with increases in speed of compute engines POMPA Workshop May 3rd 2010

I/O reaches a scaling limit Computation: Scales O(P) for P processors Minor scaling problem – issues of halo memory bandwidth, vector lengths, efficiency of software pipeline etc. Communication: Scales O(√P) for P processors Major scaling problem – the halo region decreases slowly as you increase the number of processors I/O (mainly “O”): No scaling Limiting factor in scaling– the same amount of total data is output at each time step POMPA Workshop May 3rd 2010

Current I/O strategies in COSMO • Two types of output format – GRIB and NetCDF • Grib dominant in operational weather forecasting • NetCDF is the main format used in climate research • GRIB output has the possibility of using asynchronous I/O processes to improve parallel performance • NetCDF is always ultimately serialised through process zero of the simulation • Actually in each case of GRIB and NetCDF the output is a multi-level data collection approach POMPA Workshop May 3rd 2010

Multi-level approach A 3 x 6 grid of processes, each with 3 atmospheric levels Collect on atmospheric levels Proc 2 Proc 1 Proc 0 Data on process 5 I/O Proc Data on process 0 (over 3 atm. Levels) Storage Data is sent to I/O proc level by level POMPA Workshop May 3rd 2010

Performance limitations and constraints • Both Grib and NetCDF formats carry out the gather on levels stage • For Grib-based weather simulations the final collect-and-store stage can deploy multiple I/O processes to deal with the data. • Allows improved performance where real storage bandwidth is the bottleneck • Produces multiple files (one per I/O process) that can easily be concatenated together • Only process 0 can currently act as an I/O proc for the collect-and-store stage with NetCDF • Serialises the I/O through one compute process POMPA Workshop May 3rd 2010

Possible strategies for fast NetCDF I/O • Use a version of parallel NetCDF to have all compute processes write to disk • eliminate both the gather-on-levels and collect-and-store stages • Use a version of parallel NetCDF on the subset of compute processes that are needed for the gather stage • Eliminate the collect-and-store stage • Use a set of asynchronous processes as is currently done in the Grib implementation • If more than one asynchronous process is employed this would require parallel NetCDF or post-processing POMPA Workshop May 3rd 2010

Full parallel strategy • A simple micro-benchmark of 3D data distributed on a 2D process grid showed reasonable results • This was implemented in the RAPS code and tested with the IPCC benchmark at ~ 900 cores • No smoothing operations in this benchmark or in the code • The results were poor • Much of the I/O in this benchmark is 2D fields • Not much data is written at each timestep • The current I/O performance is not bad • The parallel strategy became dominated by metadata operations • File writes for 3D fields were reasonably fast (~0.025s for 50 Mbytes) • Opening the file took a long time (0.4 to 0.5 seconds) • The strategy may be useful for high-resolution simulations writing large 3D blocks of data • Originally this strategy was expected to target 2000 x 1000 x 60+ grids POMPA Workshop May 3rd 2010

Slowdown from metadata The first strategy has problems related to metadata scalability Most modern high-performance file systems use POSIX I/O to open/close/seek etc. This is not scalable as it reduces file access operations to the time taken for Metadata operations POMPA Workshop May 3rd 2010

Non-scalable metadata – file open speeds Time in seconds to open a file against number of MPI processes involved in file open • Opening a file is not a scalable operation on modern parallel file systems • See graph of two CSCS filesystems!! • There are some mitigation strategies in MPI’sRomio/Adio layer • “Delayed open” only makes the POSIX open call when actually needed • For MPI-IO collective operations, only a subset of processes actually write the data • No mitigation strategies for specific file systems (Lustre and GPFS) • With current file systems using POSIX I/O calls full parallel I/O is not scalable • We need to pursue the other strategies for COSMO, unless large blocks of data are being written POMPA Workshop May 3rd 2010

Next steps • We are looking at all 3 strategies for improving NetCDF I/O • We are investigating the current state of Metadata accesses in the MPI-IO layer and in file systems in general • Particularly Lustre and GPFS, but others (e.g. OrangeFS) • … but for some jobs the individual I/O operations might not be large enough to allow much speedup POMPA Workshop May 3rd 2010

Pursuing Faster I/O in COSMO