PP POMPA (WG6) Overview Talk

PP POMPA (WG6)Overview Talk st Birthday COSMO GM11, Rome

Who is POMPA? • ARPA-EMR Davide Cesari • C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna • CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin • Cray Pozanovich Jeffrey, Roberto Ansaloni • CSCS Matthew Cordery, Mauro Biancho, Jean-Guillaume Piccinali, William Sawyer, Neil Stringfellow, Thomas Schulthess, Ugo Varetto • DWD Ulrich Schättler, Kristina Fröhlich • KIT Andrew Ferrone, Hartwig Anzt • MeteoSwiss Petra Baumann, Oliver Fuhrer, André Walser • NVIDIA Tim Schröder, Thomas Bradley • Roshydromet Dmitry Mikushin • SCS Tobias Gysi, Men Muheim, David Müller, Katharina Riedinger • USAM David Palella, Alessandro Cheloni, Pier Francesco Coppola • USI Daniel Ruprecht

Kickoff Workshop • May 3-4 2011, hosted by CSCS in Manno • 15 talks, 18 participants • Goal get to know each other, report on work already done, plan and coordinate future activities • Revised project plan

Task Overview • Task 1 Performance analysis and documentation • Task 2 Redesign memory layout and data structures • Closely linked to work in Task 5 and 6 • Task 3 Improve current parallelization • Task 4 Parallel I/O • Focus on NetCDF (which is still from 1 core) • Technical problems • New person (Carlos Osuna, C2SM) starting work on 15.09.2011 • Task 5 Redesign implementation of dynamical core • Task 6 Explore GPU acceleration • Task 7 Implementation documentation • No progress

Performance Analysis Goal • Understand the code from a performance perspective (workflow, data movement, bottlenecks, problems, …) • Guide and prioritize the work in the other tasks • Try to ensure exchange of information and performance portability developments

Performance Analysis (Task 1) Work • COSMO RAPS 5.0 benchmark with DWD, MeteoSwiss and IPCC/ETH runscripts on hpcforge.org (Ulrich Schättler, Oliver Fuhrer, Anne Roches) • Workflowof RK timestep (Ulrich Schättler)http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler • Performance analysis • COSMO RAPS 5.0 on Cray XT4, XT5 and XE6 (Jean-Guillaume Piccinali, Anne Roches) • COSMO-ART (Oliver Fuhrer) • Wiki page

Jean-Guillaume Piccinali and Anne Roches

Problem: Overfetching • Computational intensity is the ration of floating point operations (ops) per memory reference (ref) • When accessing a single array value, a complete cache line (64 Bytes = 8 double precision values) is loaded into L1 cache • do i = 1+nbounlines, ie-nbounlines A(i) = 0.0d0end do …also loads A(1), A(2), A(3) • If subdomain on processor is very small many values loaded from memory never get used for computation

Performance Analysis: Wiki https://wiki.c2sm.ethz.ch/Wiki/ProjPOMPATask1

Improve Current Parallelization (Task 2) • Loop level hybrid parallelization (OpenMP/MPI) (Matthew Cordery, Davide Cesari, Stefano Zampini) • No clear benefit of this approach vs. flat MPI parallelization • Approach suitable for memory bandwidth bound code? • Restructuring of code (into blocks) may help! • Overlap communication with computation using non-blocking MPI calls (Stefano Zampini) • Lumped halo-updates for COSMO-ART (Christoph Knote, Andrew Ferrone)

Halo exchange in Cosmo • 3 types of point to pointcommunications: 2 partially non-blocking and 1 full blocking (with MPI_SENDRECV) • Halo swappingneedscompletion of East to West beforestarting South to North communication (implicit corner exchange) • New versionwhichcommunicates corners (2x more messages) Stefano Zampini

New halo-exchange routine OLD CALL exch_boundaries(A) communication time NEW CALL exch_boundaries(A,2) CALL exch_boundaries(A,2) CALL exch_boundaries(A,3) communication time Stefano Zampini

Earlyresults: COSMO2 Total time (s) for model runsMeantotal time for RK dynamics • IsTestany / Waitany the mostefficient way to assurecompletion? • Restructuring of code to find more work (B) could help!

Explore GPU Acceleration (Task 6) Goal • Investigate whether and how GPUs can be leveraged for numerical weather prediction with COSMO Background • Early investigations by Michalakes et al. using WRF physical parametrizations • Full port of JMA next-generation model (ASUCA) to GPUs via a rewrite in CUDA • New model developments (e.g. NIM at NOAA) which have GPUs as a target architecture in mind from the very start

GPU Motivation Chip Architecture Peak Performance Memory Bandwidth Power Consumption Price per Node Intel Westmere 6 cores @ 3.4 GHz 81.6 GFlops 32 GB/s 130 Watt X $ NVIDIA Fermi M2090 512 cores @ 1.3 GHz 665 GFlops 155 GB/s 225 Watt X $ compute bound × 8 memory bound × 5 “power bound” × 1.7

Programming GPUs • Programming languages (OpenCL, CUDA C, CUDA Fortran, …) • Two codes to maintain • Highest control, but require complete rewrite • Highest performance (if done by expert) • Directive based approach (PGI, OpenMP-acc, HMPP, …) • Smaller modifications to original code • The resulting code is still understandable by Fortran programmers and can be easily modified • Possible performance sacrifice (w.r.t. rewrite) • No standard for the moment • Source-to-source translation (F2C-acc, Kernelgen, …) • One source code • Can achieve very good performance • Legacy codes often don’t map very well onto GPUs • Hard to debug

Challenges • How to change a wheel on a moving car? • GPU hardware and programming models are rapidly changing • Several approaches are vendor bound and/or not part of a standard • COSMO is also rapidly evolving • How to have a single readable code which also compiles onto GPUs? • Efficiency may require restructuring or even a change of algorithm • Directives jungle • Efficient GPU implementation requires… • to execute all of COSMO on the GPU • enough fine grain parallelism (i.e. threads)

Explore GPU Acceleration (Task 6) Work • Source-to-source translation of the whole model (Dmitry Mikushin) • Porting of physical parametrizations using PGI directives or f2c-acc (Xavier Lapillone, Cristiano Padrin) next talk • Rewrite of dynamical core for GPUs (Oliver Fuhrer) talk after next talk

HP2C OPCODE Project • Additional proposal to the Swiss HP2C initiative to build an “OPerational COSMO DEmonstrator (OPCODE)” • Project proposal accepted • Start of project 1 June 2011 until end of 2012 • Project lead: André Walser • Project resources: • second contract with IT company SCS to continue collaboration until end of 2012 • 2 new positions at MeteoSwiss for about 1 year • contribution to position at C2SM • contribution from CSCS

GPU based hardware (a few rack units) Cray XT4 (3 cabinets) HP2C OPCODE Project Main Goals • Leverage the research results of the ongoing HP2C COSMO project • Prototype implementation of the MeteoSwiss production suite making aggressive use of GPU technology • Similar time-to-solution on hardware with substantially lower power consumption and price

Thank you!

PP POMPA (WG6) Overview Talk