1 / 35

Parallel IO in the Community Earth System Model

Parallel IO in the Community Earth System Model. Jim Edwards John Dennis (NCAR) Ray Loy(ANL ) Pat Worley (ORNL). Some CESM 1.1 Capabilities: Ensemble configurations with multiple instances of each component Highly scalable capability proven to 100K+ tasks Regionally refined grids

rangle
Download Presentation

Parallel IO in the Community Earth System Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel IO in the Community Earth System Model Jim Edwards John Dennis (NCAR) Ray Loy(ANL) Pat Worley (ORNL)

  2. Some CESM 1.1 Capabilities: • Ensemble configurations with multiple instances of each component • Highly scalable capability proven to 100K+ tasks • Regionally refined grids • Data assimilation with DART

  3. Prior to PIO • Each model component was independent with it’s own IO interface • Mix of file formats • NetCDF • Binary (POSIX) • Binary (Fortran) • Gather-Scatter method to interface serial IO

  4. Steps toward PIO • Converge on a single file format • NetCDF selected • Self describing • Lossless with lossy capability (netcdf4 only) • Works with the current postprocessing tool chain

  5. Extension to parallel • Reduce single task memory profile • Maintain single filedecomposition independent format • Performance (secondary issue)

  6. Parallel IO from all compute tasks is not the best strategy • Data rearrangement is complicated leading to numerous small and inefficient IO operations • MPI-IO aggregation alone cannot overcome this problem

  7. Parallel I/O library (PIO) • Goals: • Reduce per MPI task memory usage • Easy to use • Improve performance • Write/read a single file from parallel application • Multiple backend libraries: MPI-IO,NetCDF3, NetCDF4, pNetCDF, NetCDF+VDC • Meta-IO library: potential interface to other general libraries

  8. CAM ATMOSPHERIC MODEL • CISL LAND ICE MODEL • CLM LAND MODEL • CPL7 COUPLER PIO • CICE OCEAN ICE MODEL • POP2 OCEAN MODEL VDC netcdf4 pnetcdf HDF5 netcdf3 MPI-IO

  9. PIO design principles • Separation of Concerns • Separate computational and I/O decomposition • Flexible user-level rearrangement • Encapsulate expert knowledge

  10. Separation of concerns • What versus How • Concern of the user: • What to write/read to/from disk? • eg: “I want to write T,V, PS.” • Concern of the library developer: • How to efficiently access the disk? • eq: “How do I construct I/O operations so that write bandwidth is maximized?” • Improves ease of use • Improves robustness • Enables better reuse

  11. Separate computational and I/O decomposition computational decomposition Rearrangement between computational and I/O decompositions I/O decomposition

  12. Flexible user-level rearrangement • A single technical solution is not suitable for the entire user community: • User A: Linux cluster, 32 core job, 200 MB files, NFS file system • User B: Cray XE6, 115,000 core job, 100 GB files, Lustre file system Different compute environment requires different technical solution!

  13. Writing distributed data (I) I/O decomposition Computational decomposition Rearrangement • + Maximize size of individual io-op’s to disk • - Non-scalable user space buffering • Very large fan-in  large MPI buffer allocations • Correct solution for User A

  14. Writing distributed data (II) I/O decomposition Computational decomposition Rearrangement • + Scalable user space memory • + Relatively large individual io-op’s to disk • Very large fan-in  large MPI buffer allocations

  15. Writing distributed data (III) I/O decomposition Computational decomposition Rearrangement • + Scalable user space memory • + Smaller fan-in -> modest MPI buffer allocations • Smaller individual io-op’s to disk • Correct solution for User B

  16. Encapsulate Expert knowledge • Flow-control algorithm • Match size of I/O operations to stripe size • Cray XT5/XE6 + Lustre file system • Minimize message passing traffic at MPI-IO layer • Load balance disk traffic over all I/O nodes • IBM Blue Gene/{L,P}+ GPFS file system • Utilizes Blue Gene specific topology information

  17. Experimental setup • Did we achieve our design goals? • Impact of PIO features • Flow-control • Vary number of IO-tasks • Different general I/O backends • Read/write 3D POP sized variable [3600x2400x40] • 10 files, 10 variables per file, [max bandwidth] • Using Kraken (Cray XT5) + Lustrefilesystem • Used 16 of 336 OST

  18. 3D POP arrays [3600x2400x40]

  19. 3D POP arrays [3600x2400x40]

  20. 3D POP arrays [3600x2400x40]

  21. 3D POP arrays [3600x2400x40]

  22. 3D POP arrays [3600x2400x40]

  23. PIOVDCParallel output to a VAPOR Data Collection (VDC) • VDC: • A wavelet-based, gridded data format supporting both progressive access and efficient data subsetting • Data may be progressively accessed (read back) at different levels of detail, permitting the application to trade off speed and accuracy • Think GoogleEarth: less detail when the viewer is far away, progressively more detail as the viewer zooms in • Enables rapid (interactive) exploration and hypothesis testing that can subsequently be validated with full fidelity data as needed • Subsetting • Arrays are decomposed into smaller blocks that significantly improve extraction of arbitrarily oriented sub arrays • Wavelet transform • Similar to Fourier transforms • Computationally efficient: order O(n) • Basis for many multimedia compression technologies (e.g. mpeg4, jpeg2000)

  24. Other PIO Users • Earth System Modeling Framework (ESMF) • Model for Prediction Across Scales (MPAS) • Geophysical High Order Suite for Turbulence (GHOST) • Data Assimilation Research Testbed (DART)

  25. Write performance on BG/L Penn State University

  26. Read performance on BG/L Penn State University

  27. 100:1 Compression with coefficient prioritization10243 Taylor-Green turbulence (enstrophy field) [P. Mininni, 2006] No compression Coefficient prioritization (VDC2)

  28. 40963 Homogenous turbulence simulation Volume rendering of original enstrophy field and 800:1 compressed field 800:1 compressed: 0.34GBs/field Original: 275GBs/field Data provided by P.K. Yeung at Georgia Tech and Diego Donzis at Texas A&M

  29. F90 code generation interface PIO_write_darray ! TYPE real,int ! DIMS 1,2,3 module procedure write_darray_{DIMS}d_{TYPE} end interface genf90.pl

  30. # 1 "tmp.F90.in" interface PIO_write_darray module procedure dosomething_1d_real module procedure dosomething_2d_real module procedure dosomething_3d_real module procedure dosomething_1d_int module procedure dosomething_2d_int module procedure dosomething_3d_int end interface

  31. PIO is opensource • http://code.google.com/p/parallelio/ Documentation using doxygen • http://web.ncar.teragrid.org/~dennis/pio_doc/html/

  32. Thank you

  33. Existing I/O libraries • netCDF3 • Serial • Easy to implement • Limited flexibility • HDF5 • Serial and Parallel • Very flexible • Difficult to implement • Difficult to achieve good performance • netCDF4 • Serial and Parallel • Based on HDF5 • Easy to implement • Limited flexibility • Difficult to achieve good performance

  34. Existing I/O libraries (con’t) • Parallel-netCDF • Parallel • Easy to implement • Limited flexibility • Difficult to achieve good performance • MPI-IO • Parallel • Very difficult to implement • Very flexible • Difficult to achieve good performance • ADIOS • Serial and parallel • Easy to implement • BP file format • Easy to achieve good performance • All other file formats • Difficult to achieve good performance

More Related