350 likes | 467 Views
Parallel IO in the Community Earth System Model. Jim Edwards John Dennis (NCAR) Ray Loy(ANL ) Pat Worley (ORNL). Some CESM 1.1 Capabilities: Ensemble configurations with multiple instances of each component Highly scalable capability proven to 100K+ tasks Regionally refined grids
E N D
Parallel IO in the Community Earth System Model Jim Edwards John Dennis (NCAR) Ray Loy(ANL) Pat Worley (ORNL)
Some CESM 1.1 Capabilities: • Ensemble configurations with multiple instances of each component • Highly scalable capability proven to 100K+ tasks • Regionally refined grids • Data assimilation with DART
Prior to PIO • Each model component was independent with it’s own IO interface • Mix of file formats • NetCDF • Binary (POSIX) • Binary (Fortran) • Gather-Scatter method to interface serial IO
Steps toward PIO • Converge on a single file format • NetCDF selected • Self describing • Lossless with lossy capability (netcdf4 only) • Works with the current postprocessing tool chain
Extension to parallel • Reduce single task memory profile • Maintain single filedecomposition independent format • Performance (secondary issue)
Parallel IO from all compute tasks is not the best strategy • Data rearrangement is complicated leading to numerous small and inefficient IO operations • MPI-IO aggregation alone cannot overcome this problem
Parallel I/O library (PIO) • Goals: • Reduce per MPI task memory usage • Easy to use • Improve performance • Write/read a single file from parallel application • Multiple backend libraries: MPI-IO,NetCDF3, NetCDF4, pNetCDF, NetCDF+VDC • Meta-IO library: potential interface to other general libraries
CAM ATMOSPHERIC MODEL • CISL LAND ICE MODEL • CLM LAND MODEL • CPL7 COUPLER PIO • CICE OCEAN ICE MODEL • POP2 OCEAN MODEL VDC netcdf4 pnetcdf HDF5 netcdf3 MPI-IO
PIO design principles • Separation of Concerns • Separate computational and I/O decomposition • Flexible user-level rearrangement • Encapsulate expert knowledge
Separation of concerns • What versus How • Concern of the user: • What to write/read to/from disk? • eg: “I want to write T,V, PS.” • Concern of the library developer: • How to efficiently access the disk? • eq: “How do I construct I/O operations so that write bandwidth is maximized?” • Improves ease of use • Improves robustness • Enables better reuse
Separate computational and I/O decomposition computational decomposition Rearrangement between computational and I/O decompositions I/O decomposition
Flexible user-level rearrangement • A single technical solution is not suitable for the entire user community: • User A: Linux cluster, 32 core job, 200 MB files, NFS file system • User B: Cray XE6, 115,000 core job, 100 GB files, Lustre file system Different compute environment requires different technical solution!
Writing distributed data (I) I/O decomposition Computational decomposition Rearrangement • + Maximize size of individual io-op’s to disk • - Non-scalable user space buffering • Very large fan-in large MPI buffer allocations • Correct solution for User A
Writing distributed data (II) I/O decomposition Computational decomposition Rearrangement • + Scalable user space memory • + Relatively large individual io-op’s to disk • Very large fan-in large MPI buffer allocations
Writing distributed data (III) I/O decomposition Computational decomposition Rearrangement • + Scalable user space memory • + Smaller fan-in -> modest MPI buffer allocations • Smaller individual io-op’s to disk • Correct solution for User B
Encapsulate Expert knowledge • Flow-control algorithm • Match size of I/O operations to stripe size • Cray XT5/XE6 + Lustre file system • Minimize message passing traffic at MPI-IO layer • Load balance disk traffic over all I/O nodes • IBM Blue Gene/{L,P}+ GPFS file system • Utilizes Blue Gene specific topology information
Experimental setup • Did we achieve our design goals? • Impact of PIO features • Flow-control • Vary number of IO-tasks • Different general I/O backends • Read/write 3D POP sized variable [3600x2400x40] • 10 files, 10 variables per file, [max bandwidth] • Using Kraken (Cray XT5) + Lustrefilesystem • Used 16 of 336 OST
PIOVDCParallel output to a VAPOR Data Collection (VDC) • VDC: • A wavelet-based, gridded data format supporting both progressive access and efficient data subsetting • Data may be progressively accessed (read back) at different levels of detail, permitting the application to trade off speed and accuracy • Think GoogleEarth: less detail when the viewer is far away, progressively more detail as the viewer zooms in • Enables rapid (interactive) exploration and hypothesis testing that can subsequently be validated with full fidelity data as needed • Subsetting • Arrays are decomposed into smaller blocks that significantly improve extraction of arbitrarily oriented sub arrays • Wavelet transform • Similar to Fourier transforms • Computationally efficient: order O(n) • Basis for many multimedia compression technologies (e.g. mpeg4, jpeg2000)
Other PIO Users • Earth System Modeling Framework (ESMF) • Model for Prediction Across Scales (MPAS) • Geophysical High Order Suite for Turbulence (GHOST) • Data Assimilation Research Testbed (DART)
Write performance on BG/L Penn State University
Read performance on BG/L Penn State University
100:1 Compression with coefficient prioritization10243 Taylor-Green turbulence (enstrophy field) [P. Mininni, 2006] No compression Coefficient prioritization (VDC2)
40963 Homogenous turbulence simulation Volume rendering of original enstrophy field and 800:1 compressed field 800:1 compressed: 0.34GBs/field Original: 275GBs/field Data provided by P.K. Yeung at Georgia Tech and Diego Donzis at Texas A&M
F90 code generation interface PIO_write_darray ! TYPE real,int ! DIMS 1,2,3 module procedure write_darray_{DIMS}d_{TYPE} end interface genf90.pl
# 1 "tmp.F90.in" interface PIO_write_darray module procedure dosomething_1d_real module procedure dosomething_2d_real module procedure dosomething_3d_real module procedure dosomething_1d_int module procedure dosomething_2d_int module procedure dosomething_3d_int end interface
PIO is opensource • http://code.google.com/p/parallelio/ Documentation using doxygen • http://web.ncar.teragrid.org/~dennis/pio_doc/html/
Existing I/O libraries • netCDF3 • Serial • Easy to implement • Limited flexibility • HDF5 • Serial and Parallel • Very flexible • Difficult to implement • Difficult to achieve good performance • netCDF4 • Serial and Parallel • Based on HDF5 • Easy to implement • Limited flexibility • Difficult to achieve good performance
Existing I/O libraries (con’t) • Parallel-netCDF • Parallel • Easy to implement • Limited flexibility • Difficult to achieve good performance • MPI-IO • Parallel • Very difficult to implement • Very flexible • Difficult to achieve good performance • ADIOS • Serial and parallel • Easy to implement • BP file format • Easy to achieve good performance • All other file formats • Difficult to achieve good performance