1 / 9

GPU Parallelization Strategy for Irregular Grids

GPU Parallelization Strategy for Irregular Grids. September 8, 2011 Presented by Jim Rosinski Mark Govett , Tom Henderson, Jacques Middlecoff. Regular vs. Irregular grid. Lat/Lon Model. Icosahedral Model. Near constant resolution over the globe Efficient high resolution simulations.

alyssa
Download Presentation

GPU Parallelization Strategy for Irregular Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPU Parallelization Strategy for Irregular Grids September 8, 2011 Presented by Jim Rosinski Mark Govett, Tom Henderson, Jacques Middlecoff

  2. Regular vs. Irregular grid Lat/Lon Model Icosahedral Model • Near constant resolution over the globe • Efficient high resolution simulations (slide courtesy Dr. Jin Lee)

  3. Fortran Loop Structures in NIM • Dynamics: (lev,column) • Lots of data independence in both dimensions => thread over levels, block over columns. • Exceptions: MPI messages, vertical summations • TBD Physics: (column,lev,[chunk]) • True for WRF, CAM • Dependence in “lev” dimension Multi-core Workshop

  4. GPU-izing Physics • Problem 1: Can’t thread or block over “k” because most physics has “k” dependence • In NIM this leaves only 1 dimension for parallelism, and efficient CUDA needs 2 • Problem 2: Data transpose needed Multi-core Workshop

  5. Solution to GPU-izing Physics • Transpose and “chunking”: !ACC$REGION(<chunksize>,<nchunks>,<dynvars,physvars:none>) BEGIN do n=1,nvars_dyn2phy ! Loop over dynvars needed in phy do k=1,nz ! Vertical !ACC$DO PARALLEL (1) do c=1,nchunks ! Chunksize*nchunks >= nip !ACC$DO VECTOR (1) do i=1,chunksize ! 128 is a good number to choose for chunksize ipn = min (ipe, ips + (c-1)*chunksize + (i-1)) physvars(i,c,k,n) = dynvars(k,ipn,n) end do end do end do end do !ACC$REGION END Multi-core Workshop

  6. Transpose Performance in NIM Multi-core Workshop

  7. CPU code runs fast • Used PAPI to count flops (Intel compiler) • Requires –O1 (no vectorization) to be accurate! • 2nd run with –O3 (vectorization) to get wallclock 27% of peak on Westmere 2.8 GHz Multi-core Workshop

  8. NIM scaling Multi-core Workshop

  9. Current Software Status • Full dynamics runs on CPU or GPU • ~5X speedup socket to socket on GPU • GPU solution judged reasonable comparing output field diffs vs. applying rounding-level perturbation • Dummy physics can run on CPU or GPU • Single-source • GPU directives ignored in CPU mode • NO constructs that look like: #ifdef GPU <do this> #else <do that> #endif Multi-core Workshop

More Related