Accelerating SYMV kernel on GPUs

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST ahmad.ahmad@kaust.edu.sa

Agenda • Motivation • GPU Technology • GPU Optimization issues • MAGMA SYMV kernel • The new SYMV Kernel • Performance Results • What Helped us? • Future Work

Motivation • GPUs are invading HPC community. • Many cores (~512) on a single GPU card. • Best suited for massively (embarrassingly) parallel problem. • Unlike CPUs, dedicate more silicon to floating point operation. • Unlike CPUs, consume much less power. • Three of the top 5 supercomputers are heterogeneous (CPUs + GPUs) • The world’s biggest supercomputer to be built will have 18,000 GPUs • To get high performance, it is quite a challenge

GPU Technology (Fermi) SM SM SM L2-Cache DRAM

GPU Technology(Fermi) • For each SM • 32 cores. • 64K L1/SHMEM • 16 LS/ST units • 4 SFUs • 32768 registers (32-bits)

GPU Technology (Fermi) • Fermi GPUs are the first GPU in the world with complete memory hierarchy • (registers, L1 cache/SHMEM, L2 cache, DRAM) • Fermi is the first GPU with ECC support. • Fermi theoretical peak performance: • 1 Tflop/s (single precision) • ~ 500 Gflop/s (double precision)

GPU Technology • Why is it tough? Let’s take a look at the programming model… A user program is designed as a grid of computation blocks Each block occupies one SM and has dedicated local memory Blocks share L2 cache and Global Memory

GPU Technology A single computation block is divided in threads in 1D, 2D, or 3D arrays • Why is it tough? Let’s take a look at the programming model… Commonly known as Thread Block Threads are executed in warps (groups of 32)

GPU Optimization Issues • General • Load balancing between computational blocks. • Data caching for reused data. • Data prefetching (to mask memory latency). • Avoid going to SLOW global memory as much as possible • Memory coalesced access (per warp) • GPU Specific • Avoid shared memory bank conflict. • Avoid divergent branches (within same warp). • Avoid using many registers per thread (63 in Fermi). • Wisely use SM resources to increase occupancy (since one SM can host more than one computation block simultaneously)

The SYMV Kernel • A level-2 BLAS kernel • Compute:Y = α × A × X + β × Y • A is a symmetric matrix (S-D-C-Z) • X and Y are vectors • Α and β are scalars • Only lower/upper side of A should be referenced. • The operation of mat-vec multiplication involve data reuse in the vector X only. • No data reuse can be exploited regarding the elements of matrix A (except for symmetry).

MAGMA SYMV Kernel (SC’11 paper) • Main ideas • Matrix is divided into 64×64 sub-matrices. • Each computation block is responsible for one horizontal row of submatrices. • A computation block starts by the diagonal sub-matrix of the assigned row. • Non diagonal sub-matrices are regarded twice • One for non-transposed sub-matrix. • Second for transposed sub-matrix to exploit symmetry. • Recursive Blocking • Used to save shared memory. • Each sub-matrix is processed in 32×32 chunks • Pointer Re-directing • Used to handle non 64X matrix dimension

MAGMA SYMV Kernel Spelled to GLMEM for other blocks Reduction through GLMEM – computed by other blocks Reduction through SHMEM/REG + + + + + +

Main Ideas of our Design • Same 64×64 block size as MAGMA • Diagonal Blocks are isolated from non-diagonal ones. • Each computation block is responsible for one vertical column of submatrices, offering better use of locality for column major format. • No Recursive Blocking • Fermi has enough shared memory (up to 48K). • Allows more efficient data prefetching (in diagonal submatrices) • Shared memory usage is restricted to reduction operation only • In Fermi, SHMEM latency is high (compared to previous GPUs) • In MAGMA, SHMEM is used in reduction as well as storing partial results • In the new design, partial results are accumulated in registers first, and spelled once to shared memory for reduction.

The new SYMV kernel Reduction through GLMEM-computed by other blocks + + Spelled to GLMEM for other blocks + Reduction through SHMEM/REG + + +

Experiments • The new kernel • was written in CUDA C ver4.0 • was integrated into MAGMA/BLAS for testing. • is, so far, designed for 64X matrix dimension. We plan to use either pointer redirecting (same as MAGMA) or padding (easier for implementation and fast release). • is tested on Fermi (Tesal C2070) GPU with 6 GB of memory

Performance Results “cont.”

Performance Results

What helped us? • PAPI CUDA Component • Extract performance counters during kernel execution. • Really easy to use (even for a first time user)! • Mainly used to identify where possible improvements are possible. • Shared memory bank conflict • Global memory misses (load/store) • Divergent branches • Local memory usage.

What helped us? “cont.” • NVIDIA compute profiler • Extract information unavailable/hard to get through PAPI CUDA component. • Registers per thread. • GPU time • Occupancy analysis • Kernel memory bandwidth

Future Work • Distribution of work among computation blocks is not balanced. • Balancing load may lead to further improvement, but locality will not be exploited. • 1D Block cyclic assignment is intended 0 0 1 1 0 2 2 1 4 3 3 2 0 2 4 4 3 1 3 4 5

Credits • RajibNath (University of California, San Diego) • Fruitful discussion about the design of the MAGMA SYMV kernel. • Guidelines for possible improvements. • Heike Jagode (UTK) • Guidelines installation/usage of PAPI

Thank YouQuestion?

Accelerating SYMV kernel on GPUs

Accelerating SYMV kernel on GPUs

Presentation Transcript

Monte Carlo implementations on GPUs

Simulating Collective Effects on GPUs

Accelerating Fully Homomorphic Encryption on GPUs

List Ranking on GPUs

Parallel Computing on Manycore GPUs

Accelerating MATLAB Image Processing Toolbox Functions on GPUs

Exploiting Parallelism on GPUs

Accelerating progress on stroke

Linear Algebra on GPUs

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Evaluating Graph Coloring on GPUs

Physical Simulation on GPUs

Ray Tracing on Programmable GPUs

Accelerating HMMER Search on GPUs using Hybrid Task and Data Parallelism

Genesis Kernel on IXP1200

More on kernel

D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs

Physical Simulation on GPUs

More on Geant4 kernel

Havok FX Physics on NVIDIA GPUs