150 likes | 260 Views
Progress Toward Accelerating CAM-SE. Jeff Larkin <larkin@cray.com> Along with: Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor. Background.
E N D
Progress Toward Accelerating CAM-SE. Jeff Larkin <larkin@cray.com> Along with: Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark Taylor
Background • In 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012? • Answer to come on next slide • Center for Accelerated Application Research (CAAR) established to determine: • Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendors • Each team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere
CAM-SE Target Problem • 1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry. • Why is acceleration needed to “do” the problem? • When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems. • What unrealized parallelism needs to be exposed? • In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q).
Profile of Runtime % of Runtime
Next Steps • Once the dominant routines were identified, standalone kernels were created for each. • Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCL • Directives-based compiler were too immature at the time • Poor support for Fortran modules and derived types • Did not allow implementation at a high enough level • CUDA Fortran provided good performance while allowing us to remain in Fortran
Identifying Parallelism • HOMME parallelizes both MPI and OpenMP over elements • Most of the tracer advection can also parallelize over tracers (q) and levels (k) • Vertical remap is the exception, due to vertical dependence in levels. • Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.
Status • Euler_step & laplace_sphere_wk were straightforward to rewrite in CUDA Fortran • Vertical Remap was rewritten to be more amenable to GPU (made it vectorize) • Resulting code is 2X faster on CPU than original code and has been given back to the community • Edge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already) • Designed for 1 element per MPI rank, but we plan to run with more • Once this is node-aware, it can also be device-aware and greatly reduce PCIe transfers • Someone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”
Status (cont.) • Kernels were put back into HOMME and validation tests were run and passed • This version did nothing to reduce data movement, only tested kernel accuracy • In process of porting forward to current trunk and do more intelligent data movement • Currently reevaluating directives now that compilers have matured • Directives-based vertical remap now slightly outperforms hand-tuned CUDA • Still working around derived_type issues
Challenges • Data Structures (Object-Oriented Fortran) • Every node has an array of element derived types, which contains more arrays • We only care about some of these arrays, so data movement isn’t very natural • We must essentially change many non-contiguous CPU arrays into a contiguous GPU array • Parallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directives • Cray compiler handles this via whole program analysis, PGI compiler may support this via inline library
Challenges (cont.) • CUDA Fortran requires everything live in the same module • Must duplicate some routines and data structures from several module in our “cuda_mod” • Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routines • Simple for user, but developer must maintain duplicate routines • Hey Dave, when will this get changed? ;)
Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.
With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.
Vertical remap rewrite is 2X faster on the CPU and still faster on GPU. All data already resident on device from euler_step, so kernel-only time is realistic.
Future Work • Use CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performance • Move forward to CAM5/CESM1 • No chance of our work being used otherwise • Some additional, small kernels are needed to allow data to remain resident • Cheaper to run these on the GPU than to copy the data • Reprofile with accelerated application to identify next most important routines • Chemisty implicit solver is expected to be next • Physics is expected to require mature, directives-based compiler • Rinse, repeat
Conclusions • Much has been done, much remains • For a fairly new, cleanly written code, CUDA Fortran was tractable. • HOMME has very similar loop nests throughout, that was key to making this possible • Still results in multiple code paths to maintain, so we’d prefer to move to directives for the long-run • We believe GPU accelerators will be beneficial for the selected problem • We hope that it will also benefit a wider audience (CAM5 should help this)