270 likes | 404 Views
Status of CDI-pio development. Thomas Jahns < jahns@dkrz.de > IS-ENES workshop on Scalable I/O in climate models, Hamburg 2013-10-28. Outline. History Recent work Issues (big time) Outlook Summary. What is CDI-pio?.
E N D
Status of CDI-pio development Thomas Jahns <jahns@dkrz.de> IS-ENES workshop on Scalable I/O in climate models, Hamburg 2013-10-28
Outline • History • Recent work • Issues (big time) • Outlook • Summary IS-ENES workshop on Scalable I/O in climate models, 2013
What is CDI-pio? • CDI (climate data interface) is the I/O backend of CDO (climate data operators) both by Uwe Schulzweida • CDI abstracts away the differences of several file formats relevant in climate science (GRIB1, GRIB2, netCDF, SVC, EXT, IEG) • CDI-pio is the MPI-parallelization and I/O client/server infrastructure initially built by Deike Kleberg (supported GRIB1/GRIB2, regular decomposition) IS-ENES workshop on Scalable I/O in climate models, 2013
What is CDI-pio? (part 2) • Extensions of CDI-pio to • address netCDF 4.x collective output • improve robustness • make decomposition flexible were performed by Thomas Jahns for IS-ENES during 2012/early 2013 • Irina Fast ports/ported MPI-ESM1 to new API • She provided the performance results presented later IS-ENES workshop on Scalable I/O in climate models, 2013
Thebig picture IS-ENES workshop on Scalable I/O in climate models, 2013
Focal points of work 2012/2013 • Add netCDF 4.x collective output and • fallback to file per single process mapping if netCDF library doesn’t support collective I/O. • Fix edge cases in reliably writing records/files of any size. • Allow flexible decomposition specification (more on that soon) and make use of it in ECHAM. • Track deadlock in MPI_File_iwrite_shared in combination with MPI RMA on IBM PE. IS-ENES workshop on Scalable I/O in climate models, 2013
On flexible decompositions • Previous version assumed regular 1D-deco • Collective activity of model processes to match • Costly • Need to describe of data on client side • YAXT library provides just such a descriptor IS-ENES workshop on Scalable I/O in climate models, 2013
What‘s YAXT? • Library on top of MPI • Inspired by Fortran Prototype Unitrans by Mathias Pütz in ScalES-project, therefore Yet Another Exchange Tool • Built and maintained by Moritz Hanke, Jörg Behrens,Thomas Jahns • Implemented in C ⇒ type agnostic code (Fortran is supposed to have this whenever Fortran 2015 gets implemented) • Fully-featured Fortran interface (requires C-interop) • Supported by DKRZ IS-ENES workshop on Scalable I/O in climate models, 2013
Central motivation for YAXT Recurring problem: data must be rearranged across process-boundaries (halo exchange, transposition, gather/scatter, load-balancing) Consequence: Replace error-prone hand-codedMPI calls with simpler interface. IS-ENES workshop on Scalable I/O in climate models, 2013
What does YAXT provide? • Easily understandable descriptors of data as relating to global context. • Programmatic discovery and persistence ofall necessary communication partners • Sender side and • Receiver side • Persistence for the above and the corresponding memory access pattern • Thus moving what was previously code to data • As flexible as MPI datatypes • Hiding of message passing mechanism • Can be replaced with e.g. 1-sided communication • Can easily aggregate multiple communications IS-ENES workshop on Scalable I/O in climate models, 2013
What does YAXT provide? (in other words) • Several classes of index lists • Computation of exchange map (Xt_xmap), i.e. what to send/receive to /from which other processes, given • source index list and • target index list • Compute data access pattern (Xt_redist) given • exchange map, • MPI datatype of element or data access pattern(s) • Perform data transfer via MPI given • Data access pattern and • Arrays holding sources/targets IS-ENES workshop on Scalable I/O in climate models, 2013
How is YAXT used? • For every data element: • assign integer “name” and • describe via MPI datatype • Declare element “have”- and “want”-lists • Provide array addresses holding “have” elements and to hold “want” elements • YAXT computes all MPI action needed to make the redistribution happen IS-ENES workshop on Scalable I/O in climate models, 2013
Decomposition description via YAXT • Where user typically defines both sides of YAXT redistribution, CDI defines variable targets to useindices 0..n-1 consecutively. • YAXT descriptor is packed together with data, this allows for changing descriptors upon re-balancing. IS-ENES workshop on Scalable I/O in climate models, 2013
Issues with actual writing • GRIB data format assumes no ordering of records (but tools only support this within same time step) • No call in MPI can take advantage of that • Performance of MPI_File_(i)write_shared never surpasses serial writing (various methods tried) IS-ENES workshop on Scalable I/O in climate models, 2013
Results for RMA • RMA working well with RDMA characteristics on MVAPICH2,but needs MV2_DEFAULT_PUT_GET_LIST_SIZE=ntasks for more than 192 non-I/O server tasks • RMA functionally working with IBM PE,but performance is lower than rank 0 I/O with gather. • OpenMPI improves just as well as MVAPICH2,but baseline performance is so much lower that even withCDI-pio OpenMPI is slower than MVAPICH2 without IS-ENES workshop on Scalable I/O in climate models, 2013
Portable RMA = hurt • Better implement scalable message passing scheme from the outset and invest in RMA where beneficial IS-ENES workshop on Scalable I/O in climate models, 2013
ResultswithMVAPICH2- scaling IS-ENES workshop on Scalable I/O in climate models, 2013
Results with MVAPICH2 - profile IS-ENES workshop on Scalable I/O in climate models, 2013
HD(CP)2 • For scientists: cloud- and precipitation-resolving simulations • For DKRZ: huge (I/O) challenges to the power of 2 • Works with ICON: icosahedral non-hydrostatic GCM http://www.mpimet.mpg.de/en/science/models/icon.html • Goals of ICON (MPI-Met and DWD): • Scalable dynamical core and better physics IS-ENES workshop on Scalable I/O in climate models, 2013
ICON grid The ICON grid can be scaled real well: IS-ENES workshop on Scalable I/O in climate models, 2013
After scaling the grid, even finer refinements can be nested: IS-ENES workshop on Scalable I/O in climate models, 2013
Better stop illustrations here, while there’s enough pixels on a HD screen IS-ENES workshop on Scalable I/O in climate models, 2013
ICON HD(CP)2 input • Grid size 100,000,000 cells horizontal • Reading on single machine no longer works: memory on 64GB nodes is exhausted early on • But no prior knowledge of which process needs which data because of irregular grid IS-ENES workshop on Scalable I/O in climate models, 2013
ICON HD(CP)2 output • Sub-second resolution of time • 3D data also means: already prohibitive 2D grid data volume is dwarfed by volume data • Current meta-data approach of CDI-pio is unsustainable. IS-ENES workshop on Scalable I/O in climate models, 2013
Future work • Make collectors act collectively • Tighter coupling of I/O servers, but that is inevitable anyway. • More even distribution of data. • Will accommodate 2-sided as wells as 1-sided communications. • Decompose meta-data, particularly for HD(CP)2 • Further split of I/O servers into collectors and distinct writers to conserve per-process memory • Introduce OpenMP parallelization. • Input IS-ENES workshop on Scalable I/O in climate models, 2013
Summary • Working concept, but many unexpected problems in practice. • Invested work not recouped in production. • Very useful tool: YAXT • Problems only growing, invalidates previous assumptions. IS-ENES workshop on Scalable I/O in climate models, 2013
Thanks to… • Joachim Biercamp (DKRZ) • Panagiotis Adamidis (DKRZ) • Jörg Behrens (DKRZ) • Irina Fast (DKRZ) • Moritz Hanke (DKRZ) • Deike Kleberg (MPI-M) • Luis Kornblueh (MPI-M) • Uwe Schulzweida (MPI-M) • Günther Zängl (DWD) IS-ENES workshop on Scalable I/O in climate models, 2013