320 likes | 479 Views
Stencil Framework for Portable High Performance Computing. Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA. Talk Outline. Physis stencil framework Map Reduce for K Mini-app development. Multi-GPU Application Development.
E N D
Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA
Talk Outline • Physis stencil framework • Map Reduce for K • Mini-app development
Multi-GPU Application Development • Non-unified programming models • MPI for inter-node parallelism • CUDA/OpenCL/OpenACC for accelerators • Optimization • Blocking • Overlapped computation and communication CUDA MPI CUDA Good performance with low programmer productivity
High performance, highly productive programming for heterogeneous clusters Goal • High level abstractions for structured parallel programming • Simplifying programming models • Portability across platforms • Does not sacrifice too much performance Approach
Physis (Φύσις) Framework [SC’11] • Stencil DSL • Declarative • Portable • Global-view • C-based void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0);} C C+MPI Physis • DSL Compiler • Target-specific code generation and optimizations • Automatic parallelization CUDA CUDA+MPI OpenMP OpenCL
DSL Overview • C + custom data types and intrinsics • Grid data types • PSGrid3DFloat, PSGrid3DDouble, etc. • Dense Cartesian domain types • PSDomain1D, PSDomain2D, and PSDomain3D • Intrinsics • Runtime management • Grid object management (PSGridFloat3DNew, etc) • Grid accesses (PSGridCopyin, PSGridGet, etc) • Applying stencils to grids (PSGridMap, PSGridRun) • Grid reductions (PSGridReduce)
Writing Stencils • Stencil Kernel • C functions describing a single flow of scalar execution on one grid element • Executed over specified rectangular domains void diffusion(constint x, constint y, constint z,PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t); } Issues a write to grid g2 Offset must be constant Periodic access is possible with PSGridGetPeriodic.
Applying Stencils to Grids • Map: Creates a stencil closure that encapsulates stencil and grids • Run: Iteratively executes stencil closures PSGrid3DFloatg1 = PSGrid3DFloatNew(NX, NY, NZ);PSGrid3DFloatg2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ); PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10); Grouping by PSStencilRun Target for kernel fusion optimization
Implementation • DSL translator • Translate intrinsics calls to RT API calls • Generate GPU kernels with boundary exchanges based on static analysis • Using the ROSE compiler framework (LLNL) • Runtime • Provides a shared memory-like interface for multidimensional grids over distributed CPU/GPU memory Physis Code Implementation Source Code Executable Code
CUDA Thread Blocking • Each thread sweeps points in the Z dimension • X and Y dimensions are blocked with AxB thread blocks, where A and B are user-configurable parameters (64x4 by default) Z Y X
Example: 7-point Stencil GPU Code __device__ void kernel(constintx,constinty,constint z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2) { float v = (((((( *__PSGridGetAddrNoHaloFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + *__PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + *__PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + *__PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v; } __global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2) { int x = blockIdx.x * blockDim.x + threadIdx.x + offset0, y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x < dom.local_min[0] || x >= dom.local_max[0] || (y < dom.local_min[1] || y >= dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); } }
Example: 7-point Stencil CPU Code static void __PSStencilRun_0(intiter,void **stencils) { struct dim3 block_dim(64,4,1); struct __PSStencil_kernel *s0 = (struct __PSStencil_kernel *)stencils[0]; cudaFuncSetCacheConfig(__PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil(__PSGetLocalSize(0) / ((double )64))),(int )(ceil(__PSGetLocalSize(1) / ((double )4))),1); __PSDomainSetLocalSize(&s0 -> dom); s0 -> g = __PSGetGridByID(s0 -> __g_index); s0 -> g2 = __PSGetGridByID(s0 -> __g2_index); inti; for (i = 0; i < iter; ++i) {{ intfw_width[3] = {1L, 1L, 1L}; intbw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); } __PSStencilRun_kernel<<<s0_grid_dim,block_dim>>>(__PSGetLocalOffset(0), __PSGetLocalOffset(1),s0 -> dom, *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); } cudaThreadSynchronize(); }
Optimization: Overlapped Computation and Communication Inner points 1. Copy boundaries from GPU to CPU for non-unit stride cases 3. Boundary exchanges with neighbors 2. Computes interior points 4. Computes boundaries Boundary Time
Optimization Example: 7-Point Stencil CPU Code Computing Interior Points for (i = 0; i < iter; ++i) { __PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -> dom,1),*((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); intfw_width[3] = {1L, 1L, 1L}; intbw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); … __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); cudaThreadSynchronize(); } cudaThreadSynchronize(); } Boundary Exchange Computing Boundary Planes Concurrently
Local Optimization • Register blocking • Reuse loaded grid elements with registers for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k-1]+f[i][j][k+1]); } Original double kc= f[i][j][0]; doubkekn = f[i][j][1]; for (int k = 1; k < n-1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn); } Optimized
Local Optimization • Common subexpression elimination in offset computation • Eliminates intra-and inter-iteration common subexpressions for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-1)*n*n]+f[i+j*n+(k+1)*n*n]); } Original int idx = i+j*n+n*n; for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-n*n]+f[idx+n*n]); idx += n*n; } Optimized
Evaluation • Performance and productivity • Sample code • 7-point diffusion kernel (#stencil: 1) • Jacobi kernel from Himeno benchmark (#stencil: 1) • Seismic simulation (#stencil: 15) • Platform • Tsubame 2.0 • Node: Westmere-EP 2.9GHz x 2 + M2050 x 3 • Dual Infiniband QDR with full bisection BW fat tree
Productivity Similar size as sequential code in C
Optimization Effect 7-point diffusion stencil on 1 GPU (Tesla M2050)
Seismic Weak Scaling Problem size: 256x256x256 per GPU
Diffusion Strong Scaling Problem size: 512x512x4096
Himeno Strong Scaling Problem size XL (1024x1024x512)
Ongoing Work • Auto-tuning • Preliminary AT for the CUDA backend available • Supporting different accelerators • Supporting more complex problems • Stencil with limited data dependency • Hierarchically organized problems • Work unit: Dense structured grids • Overall problem domain: Sparsely connected work units • Example • NICAM: An Icosahedral model of climate simulation • UPACS: Fluid simulation of engineering problems
Further Information • Code is available at http://github.com/naoyam/physis • Maruyama et al., “Physis: Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers,” SC’11, 2011.
Talk Outline • Physis stencil framework • Map Reduce for K • Mini-app development
Map Reduce for K • In-memory MPI-based MapReduce for the K computer • Implemented as a C library • Provides (some of) Hadoop-like programming interfaces • Strong focus on scalable processing of large data sets on K • Supports standard MPI clusters too • Application examples • Metagenome sequence analysis • Replica-exchange MD • Runs hundreds of NAMD as a Map Fast data loading by the Tofu network
Talk Outline • Physis stencil framework • Map Reduce for K • Mini-app development
HPC Mini Applications • A set of mini-apps derived from national key applications. • Source code will be publicly released around Q1-Q2 ’14. • Expected final number of apps: < 20 • Part of the national pre-exa projects • Supported by the MEXT HPCI Feasibility Study program(PI: Hirofumi Tomita, RIKEN AICS)
Mini-App Methodology • Request for application submissions to the current users of the Japanese HPC systems • Documentation • Mathematical background • Target capability and capacity • Input data sets • Validation methods • Application code • Full-scale applications • Capable to run complete end-to-end simulations • 17 submissions so far • Stripped-down applications • Simplified applications with only essential part of code • 6 submissions so far
Mini-App Methodology • Deriving “mini” applications from submitted applications • Understanding performance-critical patterns • Computations, memory access, file I/O, network I/O • Reducing codebase size • Removing code not used for target problems • Applying standard software engineering practices (e.g., DRY) • Refactoring into reference implementations • Performance modeling • In collaboration with the ASPEN project (Jeff Vetter at ORNL) • (Optional) Versions optimized for specific architecture
Mni-App Example: Molecular Dynamics • Kernels: Pairwise force calculation + Long-range updates • Two alternative algorithms for solving the equivalent problems • FFT-based Particle Mesh Ewald • Bottlenecked by all-to-all communications at scale • Fast Multipole Method • Tree-based problem formulation with no all-to-all communications • Simplified problem settings • Only simulates water molecules in the NVE setting • Can reduce the codebase significantly • Easier to create input data sets of different scales • Two reference implementations to study performance implications by algorithmic differences • MARBLE (20K SLOC) • MODYLAS (16K SLOC)