Performance Debugging for Highly Parallel Accelerator Architectures

Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh, Amit Sabne, Rudolf Eigenmann (Purdue) Performance Debugging for Highly Parallel Accelerator Architectures Presentation available at: engineering.purdue.edu/dcsl

Emerging Trend • Heterogeneous computing is gaining ground as a way to accelerate performance of parallel applications • Buzzword is “accelerators” • Graphics Processing Units (GPUs) • Field Programmable Gate Arrays (FPGAs) • Attraction is high degree of parallelism close to the main processor • Example: Dell Poweredge servers have 2 Kepler GPUs with a total of 2  1536 CUDA cores

But … not so fast • Programming models for these architectures hide lots of architecture details • As they should • But, these architectures are ripe for committing horrendous performance errors • Even more so than in traditional CPU architectures • Why? • FPGA: Constrained on chip memory; careless program can wipe out any performance improvement by going to main processor • GPU: Multiple levels of memory hierarchy with widely different access latencies; Identical control flow mandated for all threads within a block

GPU Schematic Memory hierarchy CUDA hierarchy of threads, thread blocks, and grids with per thread private, per block shared, and per application global memory spaces

Specs leading to Performance Problem • Shared memory and L1 cache are limited • 16 KB-48 KB or 48 KB-16 KB • Very fast access: 1+ TB/s • Global memory is accessible by all threads on the GPU • Larger amount of memory: 8 GB • Slower access: 320 GB/s • If communication with the host memory is required (over PCI Express bus), then much slower • PLDI 12 paper shows a 5X speedup if avoiding cyclic communication

Common Patterns of Performance Bugs • Memory bugs • Un-coalescing memory access • Bank conflict of shared memory • Channel skew in global memory • The schedule of transmission of host to device memory • Multi-thread bugs • Block/Thread configuration • Branch divergence • Synchronization bugs

Performance debugger work flow Benchmarking (small scales or small data) Static Analysis Program Profiling Detect performance anomaly Localize the problem Automatic program transformation Re-benchmarking Yes NO Acceptable? Break

Example of a Performance Bug • Matrix transpose on GPU • The memory bandwidth of GTX 280 is 140 GB/sec • For 2048  2048 matrix • Naïve transpose: 2.2 GB/s • Coalesced transpose: 17.1 GB/s

Can We Do This Automatically? • Some lessons from our prior work [HPDC `11] [HotDep `12] • Training Phase (A Series of Small-scale Testing Runs) • Instrumentation to record observational features • Modeling to train a model that can predict observational features from control features • Deployment Phase (Large-scale Production Runs) • Instrumentation to record the same features • Detection to flag production runs with negative correlation • Localization • Use the trained model to reconstruct observational feature • Rank features by reconstruction error

Can We Do This Automatically? • Maybe • Some lessons from our prior work [HPDC `11] [HotDep `12] Kernel Canonical Correlation Analysis takes observational feature X and control feature Y to find f and g such that f(X) and g(Y) is highly correlated y Behavioral Feature corr(f( ), g( )) < 0 BUG! x Scale of Execution

ABHRANTA: a Predictive Model for Program Behavior at Large Scale • ABHRANTA replaced non-invertible transform gused by Vrisha with a linear transform g’ • The new model provides an automatic way to reconstruct “bug-free” behavior at large scale, lifting the burden of manual analysis of program scaling behavior g’-1(f (x)) g’(*) f(x) x

Results from HPC Benchmark • AMG2006 is a parallel algebraic multigrid solver for linear systems, written in 104K lines of C code. • The application is configured to solve the default 3D Laplace type problem • Train on 8-128 node runs, test at larger scales (up to 4096 nodes) • Fault injection study – Integer overflows, buffer overflows • Control features: X, Y, Z dimensions of 3D grid • Observational features: All conditionals indexed by calling context

Can We Do This for GPU Programs? (Wild ?) Speculation! • We think we can • Features that make this approach more feasible • More regular kernels than general purpose programs • Good places to insert monitors to observe behavioral features • Often spare computational capacity close by • Types of performance bugs are limited • Types of program transformations limited

Presentation available at:Dependable Computing Systems Lab (DCSL) web siteengineering.purdue.edu/dcsl

Performance Debugging for Highly Parallel Accelerator Architectures