220 likes | 347 Views
Accelerating SYMV kernel on GPUs. Ahmad M Ahmad, AMCS Division, KAUST ahmad.ahmad@kaust.edu.sa. Agenda. Motivation GPU Technology GPU Optimization issues MAGMA SYMV kernel The new SYMV Kernel Performance Results What Helped us? Future Work. Motivation.
E N D
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST ahmad.ahmad@kaust.edu.sa
Agenda • Motivation • GPU Technology • GPU Optimization issues • MAGMA SYMV kernel • The new SYMV Kernel • Performance Results • What Helped us? • Future Work
Motivation • GPUs are invading HPC community. • Many cores (~512) on a single GPU card. • Best suited for massively (embarrassingly) parallel problem. • Unlike CPUs, dedicate more silicon to floating point operation. • Unlike CPUs, consume much less power. • Three of the top 5 supercomputers are heterogeneous (CPUs + GPUs) • The world’s biggest supercomputer to be built will have 18,000 GPUs • To get high performance, it is quite a challenge
GPU Technology (Fermi) SM SM SM L2-Cache DRAM
GPU Technology(Fermi) • For each SM • 32 cores. • 64K L1/SHMEM • 16 LS/ST units • 4 SFUs • 32768 registers (32-bits)
GPU Technology (Fermi) • Fermi GPUs are the first GPU in the world with complete memory hierarchy • (registers, L1 cache/SHMEM, L2 cache, DRAM) • Fermi is the first GPU with ECC support. • Fermi theoretical peak performance: • 1 Tflop/s (single precision) • ~ 500 Gflop/s (double precision)
GPU Technology • Why is it tough? Let’s take a look at the programming model… A user program is designed as a grid of computation blocks Each block occupies one SM and has dedicated local memory Blocks share L2 cache and Global Memory
GPU Technology A single computation block is divided in threads in 1D, 2D, or 3D arrays • Why is it tough? Let’s take a look at the programming model… Commonly known as Thread Block Threads are executed in warps (groups of 32)
GPU Optimization Issues • General • Load balancing between computational blocks. • Data caching for reused data. • Data prefetching (to mask memory latency). • Avoid going to SLOW global memory as much as possible • Memory coalesced access (per warp) • GPU Specific • Avoid shared memory bank conflict. • Avoid divergent branches (within same warp). • Avoid using many registers per thread (63 in Fermi). • Wisely use SM resources to increase occupancy (since one SM can host more than one computation block simultaneously)
The SYMV Kernel • A level-2 BLAS kernel • Compute:Y = α × A × X + β × Y • A is a symmetric matrix (S-D-C-Z) • X and Y are vectors • Α and β are scalars • Only lower/upper side of A should be referenced. • The operation of mat-vec multiplication involve data reuse in the vector X only. • No data reuse can be exploited regarding the elements of matrix A (except for symmetry).
MAGMA SYMV Kernel (SC’11 paper) • Main ideas • Matrix is divided into 64×64 sub-matrices. • Each computation block is responsible for one horizontal row of submatrices. • A computation block starts by the diagonal sub-matrix of the assigned row. • Non diagonal sub-matrices are regarded twice • One for non-transposed sub-matrix. • Second for transposed sub-matrix to exploit symmetry. • Recursive Blocking • Used to save shared memory. • Each sub-matrix is processed in 32×32 chunks • Pointer Re-directing • Used to handle non 64X matrix dimension
MAGMA SYMV Kernel Spelled to GLMEM for other blocks Reduction through GLMEM – computed by other blocks Reduction through SHMEM/REG + + + + + +
Main Ideas of our Design • Same 64×64 block size as MAGMA • Diagonal Blocks are isolated from non-diagonal ones. • Each computation block is responsible for one vertical column of submatrices, offering better use of locality for column major format. • No Recursive Blocking • Fermi has enough shared memory (up to 48K). • Allows more efficient data prefetching (in diagonal submatrices) • Shared memory usage is restricted to reduction operation only • In Fermi, SHMEM latency is high (compared to previous GPUs) • In MAGMA, SHMEM is used in reduction as well as storing partial results • In the new design, partial results are accumulated in registers first, and spelled once to shared memory for reduction.
The new SYMV kernel Reduction through GLMEM-computed by other blocks + + Spelled to GLMEM for other blocks + Reduction through SHMEM/REG + + +
Experiments • The new kernel • was written in CUDA C ver4.0 • was integrated into MAGMA/BLAS for testing. • is, so far, designed for 64X matrix dimension. We plan to use either pointer redirecting (same as MAGMA) or padding (easier for implementation and fast release). • is tested on Fermi (Tesal C2070) GPU with 6 GB of memory
What helped us? • PAPI CUDA Component • Extract performance counters during kernel execution. • Really easy to use (even for a first time user)! • Mainly used to identify where possible improvements are possible. • Shared memory bank conflict • Global memory misses (load/store) • Divergent branches • Local memory usage.
What helped us? “cont.” • NVIDIA compute profiler • Extract information unavailable/hard to get through PAPI CUDA component. • Registers per thread. • GPU time • Occupancy analysis • Kernel memory bandwidth
Future Work • Distribution of work among computation blocks is not balanced. • Balancing load may lead to further improvement, but locality will not be exploited. • 1D Block cyclic assignment is intended 0 0 1 1 0 2 2 1 4 3 3 2 0 2 4 4 3 1 3 4 5
Credits • RajibNath (University of California, San Diego) • Fruitful discussion about the design of the MAGMA SYMV kernel. • Guidelines for possible improvements. • Heike Jagode (UTK) • Guidelines installation/usage of PAPI