320 likes | 456 Views
Generating GPU-Accelerated Code From a High-level Domain-specific Language. Graham Markall Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~grm08 Joint work with David Ham and Paul Kelly. October 2009. TexPoint fonts used in EMF.
E N D
Generating GPU-Accelerated Code From a High-level Domain-specific Language Graham Markall Software Performance Optimisation Group Imperial College London http://www.doc.ic.ac.uk/~grm08 Joint work with David Ham and Paul Kelly October 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA
Problem & Proposal • How do we exploit multicore architectures to improve the performance of the Assembly Phase? • Writing code for new architectures is time-consuming and error-prone • Provide hardware-independent abstraction for the specification of finite element methods. • Future proofing of code • Easier development • Background: • Conjugate Gradient GPU Solver 10x faster than one CPU core • Solvers are generic – someone else will solve this problem
This Study • We present a pilot study into using the Unified Form Language to generate CUDA code. • Why bother with part 1? • To prove we can speed up assembly using GPUs • To provide a guide for the output we expect from the compiler • To experiment with different performance optimisations • Part 1: • Nvidia Tesla Architecture & CUDA • Test Problems • Translation Methodology • Performance Optimisations • Performance Results • Part 2: • UFL • UFL Compiler Design • Test Results • Discussion
NVIDIA Tesla Architecture & CUDA • GT200 Architecture: • 1-4GiB RAM • For high performance: • Coalescing: • Use many threads (10000+) • Caches: • Texture cache (read-only) • Shared memory Data transfer Data transfer 64B window 16 threads (half-warp)
The Test Problems • Test_laplacian: Solves ∆u = f on unit square • Analytical solution: Allows us to examine the accuracy of the computed solution • Test_advection_diffusion: • Advection-Diffusion is more representative: • Time-dependent, nonlinear • multiple assemble/solve
From Fortran to CUDA (Test_laplacian) • Assembly Loop in Fortran: • Assembly Loop in CUDA: 1 Element doele=1,num_ele call assemble(ele,A,b) end do callpetsc_solve(x,A,b) call gpu_assemble() call gpucg_solve(x) All Elements
From Fortran to CUDA (Adv.-Diff.) • Original: • Assemble • Solve • Output of solve input to next Assemble • CUDA: • Avoid transferring the solution at every iteration • Upload initial conditions • Iterate • Transfer solution when required
Performance Optimisations x1 y1 x2 y2 x3 y3 • CoalescingMaximise memorybandwidth 1 2 3 n-2 n-1 n 1 2 3 x1 y1 x2 y2 x3 y3 (x1,y1) n-2 n-1 n (x3,y3) (x2,y2) ... ...
Performance Optimisations x1 y1 x2 y2 x3 y3 • CoalescingMaximise memorybandwidth • Specialisation of Kernels (reduced register usage) 1 2 3 n-2 n-1 n 1 2 3 x1 y1 x2 y2 x3 y3 (x1,y1) n-2 n-1 n (x3,y3) (x2,y2) for(int x=0; x<nodes; x++) { for(int y=0; y<nodes; y++) { ...; } } for(int x=0; x<3; x++) { for(int y=0; y<3; y++) { ...; } } ... ...
Performance Optimisations x1 y1 x2 y2 x3 y3 • CoalescingMaximise memorybandwidth • Specialisation of Kernels (reduced register usage) • Texture Memory for matrix sparsity 1 2 3 n-2 n-1 n 1 2 3 x1 y1 x2 y2 x3 y3 (x1,y1) n-2 n-1 n (x3,y3) (x2,y2) for(int x=0; x<nodes; x++) { for(int y=0; y<nodes; y++) { ...; } } for(int x=0; x<3; x++) { for(int y=0; y<3; y++) { ...; } } row_ptr col_idx ... ... val
Performance Results • For Advection-Diffusion problem. Test setup: • Nvidia280GTX – 1GB RAM (use Tesla C1060 for 4GB) • Intel Core 2 Duo E8400 @ 3.00GHz • 2GB RAM in host machine • Intel C++ and Fortran Compilers V10.1 • V11.0 suffers from bugs and cannot compile Fluidity • CPU Implementations compiled with –O3 flags • CUDA Implementation compiled using NVCC 2.2 • Run problem for 200 timesteps • Increasingly fine meshes • Increasing element count • Five runs of each problem • Averages reported • Double Precision computations
Proportion of GPU Time in each Kernel • Which kernels should we focus on optimising? • Addto kernels: 84% of execution time
The Impact of Atomic Operations Colouring Optimisation on GPUs: [1] D. Komatitsch, D.Michea and G. Erlebacher. Porting a high-order finite-element earthquake modelling application to Nvidia graphics cards using CUDA. J. Par. Dist. Comp., 69(5):451-460, 2009
Summary of Part 1 • 8x Speedup over 2 CPU Cores for assembly • 6x Speedup overall • Further performance gains from: • Colouring Elements & Non-atomic ops [1] • Alternative matrix storage formats • Fusing kernels [2], Mesh partitioning [3] Fusing kernels: [2] J. Filipovic, I. Peterlik and J. Fousek. GPU Acceleration of Equations Assembly in Finite Elements Method – Preliminary Results. In SAAHPC: Symposium on Application Accelerators in HPC, July 2009. Mesh partitioning: [3] A. Klockner, T. Warburton, J. Bridge and J. S. Hesthaven. Nodal Discontinuous Galerkin methods on graphics processors. Journal of Computational Physics, in press, 2009.
Part 2: A UFL [4] Example (Laplacian) • Solving: • Weak form: • (Ignoring boundaries) • Close to mathematical notation • No implementation details • Allows code generation for multiple backends and choice of optimisations to be explored. Psi = state.scalar_fields(“psi”) v=TestFunction(Psi) u=TrialFunction(Psi) f=Function(Psi, “sin(x[0])+cos(x[1])”) A=dot(grad(v),grad(u))*dx RHS=v*f*dx Solve(Psi,A,RHS) [4] M. Alnaes and Anders Logg. Unified Form Language Specification and User’s Manual.http://www.fenics.org/pub/documents/ufl/ufl-user-manual/ufl-user-manual.pdf Retrieved 15 Sep 2009.
From UFL to CUDA • We “parse” UFL using the ufl.algorithmspackage • Leading to the creation of a DAG representing the assembly: • Similar for RHS Intermediate representation: Frontend Backend UFL CUDA IR
Testing Frontend: psi = state.scalar_fields("psi") v = TestFunction(P) u = TrialFunction(P) f = Function(P) f.name="shape_rhs" A = dot(grad(v),grad(u))*dx solve(P, A, f) Backend (example): stringList *params = new stringList(); (*params).push_back(string("val")); (*params).push_back(string("size_val")); (*params).push_back(string("ele_psi")); (*params).push_back(string("lmat")); (*params).push_back(string("n")); launchList.push_back( kernelLaunch("matrix_addto",params)); Hand Translation: Generated Code:
Testing - continued • Helmholtz equation: • Weak form: • A=(dot(grad(v), grad(u))+(20)*dot(v,u))*dx • Add extra calls to shape_shape and matrix_addto • FEniCSDolfin solution: Generated code solution:
Conclusions • We obtain speedups of 8x over 2 core CPU in the assembly phase using CUDA • An overall speedup of 6x over 2 cores • Generation of CUDA code from UFL source is feasible • UFL is “future proof” • UFL is easier to use than CUDA, Fortran etc. • Allows automated exploration of optimisations • Other backends(Cell, multicore CPU, Larrabee etc.) should be possible
Further work • On the UFL Compiler: • Support for a more complete subset of UFL • Development of a more expressive intermediate representation • Facilitates the development of other backends • Generation of kernels from IR • Automatic tuning • On the Conjugate Gradient Solver: • Integration with blocked SpMV implementation [5] • Expect: further performance improvements Blocked SpMV: [5] A. Monakov and A. Avetisyan. Implementing Blocked Sparse Matrix-Vector Multiplication on Nvidia GPUs. In SAMOS IX: International Symposium on Systems, Architectures, Modeling and Simulation,July 2009.
Test Advection Diffusion UFL Advection: T=state.scalar_fields(Tracer) U=state.vector_fields(Velocity) UNew=state.vector_fields(NewVelocity) # We are solving for the Tracer, T. t=Function(T) p=TrialFunction(T) q=TestFunction(T) #The value of the advecting velocity U is known. u=Function(U) unew=Function(UNew) #Mass matrix. M=p*q*dx #Solve for T1-T4. rhs=dt*dot(grad(q),u)*t*dx t1=solve(M,rhs) rhs=dt*dot(grad(q),(0.5*u+0.5*unew))*(t+0.5*t1)*dx t2=solve(M,rhs) rhs=dt*dot(grad(q),(0.5*u+0.5*unew))*(t+0.5*t2)*dx t3=solve(M,rhs) #Solve for T at the next time step. rhs=action(M,t) + 1.0/6.0*t1 + 1.0/3.0*t2 + 1.0/3.0*t3 + 1.0/6.0*t4 t=solve(M,t) Diffusion: mu=state.tensor_fields(TracerDiffusivity) i,j=indices(2) M=p*q*dx d=-grad(q)[i]*mu[i,j]*grad(p)[j]*dx A=m-0.5*d rhs=action(M+0.5*d,t) T=solve(A,rhs)
Memory Bandwidth Utilisation Orange: Using Atomic operations Blue: Using non-atomic operations
Proportion of GPU Time in each Kernel Orange: Using Atomic operations Blue: Using non-atomic operations
Code Generation • List of variables, kernels and parameters passed to backend. • Using the ROSE Compiler Infrastructure [6]. • CUDA Keywords (__global__, <<<...>>> notation) inserted as arbitrary strings. AST Initialisation cudaMalloc() cudaBindTexture() cudaMemcpy() Assembly kernel<<<.>>>() gpu_assemble.cu Finalisation cudaFree() cudaUnbindTexture() Streaming cudaMemcpy() Declarations Int, double, ...
NVIDIA Tesla Architecture & CUDA • GT200 Architecture • 10 TPCs • 8 Banks of DRAM: 1-4GiB • y = αx + yin C: • CUDA Kernel: void daxpy(double a, double* x, double* y, int n) { for (int i=0; i<n; i++) y[i] = y[i] + a*x[i]; } __global__ void daxpy(double a, double* x, double* y, int n) { for (int i=T_ID; i<n; i+=T_COUNT) y[i] = y[i] + a*x[i]; }
Variable naming • How do we ensure the output of a kernel is correctly input to successive kernels? Consistently invent names. Output: dshape_psi Input: dshape_psi Output: lmat_psi_psi Input: lmat_psi_psi Psi = state.scalar_fields(“psi”) v=TestFunction(Psi) u=TrialFunction(Psi) f=Function(Psi, “sin(x[0])+cos(x[1])”) A=dot(grad(v),grad(u))*dx RHS=v*f*dx Solve(Psi,A,RHS)