1 / 27

Status of CDI-pio development

Status of CDI-pio development. Thomas Jahns < jahns@dkrz.de > IS-ENES workshop on Scalable I/O in climate models, Hamburg 2013-10-28. Outline. History Recent work Issues (big time) Outlook Summary. What is CDI-pio?.

aadi
Download Presentation

Status of CDI-pio development

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Status of CDI-pio development Thomas Jahns <jahns@dkrz.de> IS-ENES workshop on Scalable I/O in climate models, Hamburg 2013-10-28

  2. Outline • History • Recent work • Issues (big time) • Outlook • Summary IS-ENES workshop on Scalable I/O in climate models, 2013

  3. What is CDI-pio? • CDI (climate data interface) is the I/O backend of CDO (climate data operators) both by Uwe Schulzweida • CDI abstracts away the differences of several file formats relevant in climate science (GRIB1, GRIB2, netCDF, SVC, EXT, IEG) • CDI-pio is the MPI-parallelization and I/O client/server infrastructure initially built by Deike Kleberg (supported GRIB1/GRIB2, regular decomposition) IS-ENES workshop on Scalable I/O in climate models, 2013

  4. What is CDI-pio? (part 2) • Extensions of CDI-pio to • address netCDF 4.x collective output • improve robustness • make decomposition flexible were performed by Thomas Jahns for IS-ENES during 2012/early 2013 • Irina Fast ports/ported MPI-ESM1 to new API • She provided the performance results presented later IS-ENES workshop on Scalable I/O in climate models, 2013

  5. Thebig picture IS-ENES workshop on Scalable I/O in climate models, 2013

  6. Focal points of work 2012/2013 • Add netCDF 4.x collective output and • fallback to file per single process mapping if netCDF library doesn’t support collective I/O. • Fix edge cases in reliably writing records/files of any size. • Allow flexible decomposition specification (more on that soon) and make use of it in ECHAM. • Track deadlock in MPI_File_iwrite_shared in combination with MPI RMA on IBM PE. IS-ENES workshop on Scalable I/O in climate models, 2013

  7. On flexible decompositions • Previous version assumed regular 1D-deco • Collective activity of model processes to match • Costly • Need to describe of data on client side • YAXT library provides just such a descriptor IS-ENES workshop on Scalable I/O in climate models, 2013

  8. What‘s YAXT? • Library on top of MPI • Inspired by Fortran Prototype Unitrans by Mathias Pütz in ScalES-project, therefore Yet Another Exchange Tool • Built and maintained by Moritz Hanke, Jörg Behrens,Thomas Jahns • Implemented in C ⇒ type agnostic code (Fortran is supposed to have this whenever Fortran 2015 gets implemented) • Fully-featured Fortran interface (requires C-interop) • Supported by DKRZ IS-ENES workshop on Scalable I/O in climate models, 2013

  9. Central motivation for YAXT Recurring problem: data must be rearranged across process-boundaries (halo exchange, transposition, gather/scatter, load-balancing) Consequence: Replace error-prone hand-codedMPI calls with simpler interface. IS-ENES workshop on Scalable I/O in climate models, 2013

  10. What does YAXT provide? • Easily understandable descriptors of data as relating to global context. • Programmatic discovery and persistence ofall necessary communication partners • Sender side and • Receiver side • Persistence for the above and the corresponding memory access pattern • Thus moving what was previously code to data • As flexible as MPI datatypes • Hiding of message passing mechanism • Can be replaced with e.g. 1-sided communication • Can easily aggregate multiple communications IS-ENES workshop on Scalable I/O in climate models, 2013

  11. What does YAXT provide? (in other words) • Several classes of index lists • Computation of exchange map (Xt_xmap), i.e. what to send/receive to /from which other processes, given • source index list and • target index list • Compute data access pattern (Xt_redist) given • exchange map, • MPI datatype of element or data access pattern(s) • Perform data transfer via MPI given • Data access pattern and • Arrays holding sources/targets IS-ENES workshop on Scalable I/O in climate models, 2013

  12. How is YAXT used? • For every data element: • assign integer “name” and • describe via MPI datatype • Declare element “have”- and “want”-lists • Provide array addresses holding “have” elements and to hold “want” elements • YAXT computes all MPI action needed to make the redistribution happen IS-ENES workshop on Scalable I/O in climate models, 2013

  13. Decomposition description via YAXT • Where user typically defines both sides of YAXT redistribution, CDI defines variable targets to useindices 0..n-1 consecutively. • YAXT descriptor is packed together with data, this allows for changing descriptors upon re-balancing. IS-ENES workshop on Scalable I/O in climate models, 2013

  14. Issues with actual writing • GRIB data format assumes no ordering of records (but tools only support this within same time step) • No call in MPI can take advantage of that • Performance of MPI_File_(i)write_shared never surpasses serial writing (various methods tried) IS-ENES workshop on Scalable I/O in climate models, 2013

  15. Results for RMA • RMA working well with RDMA characteristics on MVAPICH2,but needs MV2_DEFAULT_PUT_GET_LIST_SIZE=ntasks for more than 192 non-I/O server tasks • RMA functionally working with IBM PE,but performance is lower than rank 0 I/O with gather. • OpenMPI improves just as well as MVAPICH2,but baseline performance is so much lower that even withCDI-pio OpenMPI is slower than MVAPICH2 without IS-ENES workshop on Scalable I/O in climate models, 2013

  16. Portable RMA = hurt • Better implement scalable message passing scheme from the outset and invest in RMA where beneficial IS-ENES workshop on Scalable I/O in climate models, 2013

  17. ResultswithMVAPICH2- scaling IS-ENES workshop on Scalable I/O in climate models, 2013

  18. Results with MVAPICH2 - profile IS-ENES workshop on Scalable I/O in climate models, 2013

  19. HD(CP)2 • For scientists: cloud- and precipitation-resolving simulations • For DKRZ: huge (I/O) challenges to the power of 2 • Works with ICON: icosahedral non-hydrostatic GCM http://www.mpimet.mpg.de/en/science/models/icon.html • Goals of ICON (MPI-Met and DWD): • Scalable dynamical core and better physics IS-ENES workshop on Scalable I/O in climate models, 2013

  20. ICON grid The ICON grid can be scaled real well: IS-ENES workshop on Scalable I/O in climate models, 2013

  21. After scaling the grid, even finer refinements can be nested: IS-ENES workshop on Scalable I/O in climate models, 2013

  22. Better stop illustrations here, while there’s enough pixels on a HD screen IS-ENES workshop on Scalable I/O in climate models, 2013

  23. ICON HD(CP)2 input • Grid size 100,000,000 cells horizontal • Reading on single machine no longer works: memory on 64GB nodes is exhausted early on • But no prior knowledge of which process needs which data because of irregular grid IS-ENES workshop on Scalable I/O in climate models, 2013

  24. ICON HD(CP)2 output • Sub-second resolution of time • 3D data also means: already prohibitive 2D grid data volume is dwarfed by volume data • Current meta-data approach of CDI-pio is unsustainable. IS-ENES workshop on Scalable I/O in climate models, 2013

  25. Future work • Make collectors act collectively • Tighter coupling of I/O servers, but that is inevitable anyway. • More even distribution of data. • Will accommodate 2-sided as wells as 1-sided communications. • Decompose meta-data, particularly for HD(CP)2 • Further split of I/O servers into collectors and distinct writers to conserve per-process memory • Introduce OpenMP parallelization. • Input IS-ENES workshop on Scalable I/O in climate models, 2013

  26. Summary • Working concept, but many unexpected problems in practice. • Invested work not recouped in production. • Very useful tool: YAXT • Problems only growing, invalidates previous assumptions. IS-ENES workshop on Scalable I/O in climate models, 2013

  27. Thanks to… • Joachim Biercamp (DKRZ) • Panagiotis Adamidis (DKRZ) • Jörg Behrens (DKRZ) • Irina Fast (DKRZ) • Moritz Hanke (DKRZ) • Deike Kleberg (MPI-M) • Luis Kornblueh (MPI-M) • Uwe Schulzweida (MPI-M) • Günther Zängl (DWD) IS-ENES workshop on Scalable I/O in climate models, 2013

More Related