190 likes | 206 Views
This article explores the SIMD architecture of GPUs and discusses various performance optimization techniques for linear algebra computations. It covers topics such as vector thread programming model, register windows, cache usage, memory latency hiding, and GPU memory system. The article also includes performance analysis of matrix multiplication, SGEMM, panel factorization, and tridiagonal eigenvalue solvers on GPUs.
E N D
Linear Algebra on GPUs VasilyVolkov
GPU Architecture Features • SIMD architecture • Don’t be confused by scalar ISA which is only a program model • We use vector thread program model instead • Similar to SSE/Cell SPE/vector units, not superscalar units • Native vector length is 32; can simulate larger (thread blocks) • Multithreading using register windows • Context switch never flushes registers to memory • If more threads than can fit, some don’t start until others finish • Huge register files, small high-latency caches • Fundamental difference with CPUs, similar to vector processors • Cache is to save memory bandwidth, not reduce latency • Use register blocking, not cache blocking • Large startup costs (≈5s) • Not good for fine-grain computations • Use global barriers instead? (≈1.5 s)
Relation to Other Multicores • All offer both multithreading and SIMD capabilities • Use CUDA to program all of them?
Pointer Chase Benchmark, 8800GTX • run k=A[k] • in a loop • in a scalar thread • latency bound • larger latency at cache miss • Reveals cache sizes 8-word cache line L1: 8 x 5kB, each is 20-way associative L2: 6 x 32kB, each is 4-way associative 512kB memory pages TLB: 16 entries, fully associative 8800GTX/8800GTS/8600GTS/FX5600: Different number of similar caches
Matrix-Matrix multiply, C = C + AB • 64x16 blocks in C, rank-4 updates • Ensures that our code is compute-bound • Each thread computes one block in C • ijk/jikform; other choices produce race condition in C • Keep A’s and C’s block in vector registers • Similarly done on IBM 3090 VF and Cray X1 • Keep B’s block in local memory • Others keep it in scalar registers (n/a on GPU); or caches (slow) • Use compiler options to enforce tight register budget • To hide memory latencies better by multithreading • Use prefetching to hide latencies even better • Now, performance is not bound by latency and bandwidth in reading blocks in A and B! • Bound by instruction issue and local memory ops (230 Gflop/s)
Performance of SGEMM • CUBLAS 1.1: • keeps square blocks in A and B in local memory • uses long vectors (poor instruction mix) • exposing too much of data parallelism may cost you • Our SGEMM is now in CUBLAS 2.0 beta
SGEMM, 8800GTX, k = 1024 Constant work per vector thread (function of k) Optimized version does better load balancing by computing partial sums
Panel Factorization CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!) GPU: estimated for 8800GTX as
Design of Matrix Factorizations • Right-looking scheme = most parallelism = best on 16-core GPUs • Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS 1.1 • Left-looking = half of work in triangular solve = limited parallelism = inefficient • 2-level blocking • Both levels are right-looking + premature exit from finer level to keep • Up to 6% speedup only, at large matrices (n≈10,000) • Block size on GPU is 64 (same as in matrix multiply) • Autotuning in QR (up to 7% speedup) • Row-major layout on GPU in LU decomposition • Since gathers with large stride are inefficient • Requires transposition at every CPU-GPU transfer • >2x speedup! • Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead) • Multiply by inverse (GEMM) instead of triangular solve (TRSM) • TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64 • Parallelism in TRSM is not embarrassing enough
Test Platform • GPU • Core2 Duo 2.67GHz + GeForce 8800 GTX • Core2 Duo 2.67GHz + two GeForce 8800 GTX • CPU • Core2 Duo 2.67GHz • Core2 Quad 2.4GHz
Speedup using 2 GPUs Using column-cyclic layout
Other Work Done • Tridiagonaleigenvalue solver (bisection) • Most work: factorize A–iI = LDLT, count signs in D (compute bound) • Done for many i in parallel — traditionally vectorized • If need more parallelism — do multisection instead of bisection • But it increases total flop count • Rest is difficult to parallelize, does not worth it • Our solution: • Run vectorized loops on the GPU, rest (least work) on the CPU • Autotune to decide optimal redundancy and when involve CPU • Use features of IEEE arithmetic to save another 15-30% of runtime • Up to 156x faster than LAPACK on Pentium 4 • Tridiagonal eigenvector solver (inverse iteration) • Most work: Solve (A–iI)xk+1=xk for fixed i (bandwidth bound) • Factorize A–iI = LDLT once, keep D only. Reconstruct L on need • Reconstruction is overlapped with memory access; still bandwidth bound • Don’t pivot — recompute using safe code if fails (do it on CPU) • Up to 120x faster than LAPACK on Core2 Duo so far • More complicated when eigenvalues are clustered • Stencil computation (7-point on 3D grid) • Blocks in registers and local memory • Bandwidth-bound, runs at up to 66% of pin-bandwidth
Future Work • Analysis of architecture • Find best parallels in past architectures to reuse methods • Catching up with newer GPUs • More micro-benchmarking to get better performance models • More scientific kernels • CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop) • More LAPACK • Two-sided factorizations used in eigensolvers and SVD • LAPACK does 50% of work is in BLAS1/BLAS2 • Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are needed • May use divide-and-conquer instead • MRRR (improved inverse iteration algorithm, also rich in parallelism) • Non-symmetric eigensolvers such as QR iterations • currently fine-grained, can do better? • Iterative refinement for eigenvalue problem? • ScaLAPACK (distributed memory LAPACK) • One-sided factorizations on a GPU cluster