1 / 46

SHOC: Overview and Kernel Walkthrough

SHOC: Overview and Kernel Walkthrough. Kyle Spafford Keeneland Tutorial April 14, 2011. The Scalable Heterogeneous Computing Benchmark Suite (SHOC). Focus on scientific computing workloads, including common kernels like SGEMM, FFT, Stencils

verity
Download Presentation

SHOC: Overview and Kernel Walkthrough

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SHOC: Overview and Kernel Walkthrough Kyle Spafford Keeneland Tutorial April 14, 2011

  2. The Scalable Heterogeneous Computing Benchmark Suite (SHOC) • Focus on scientific computing workloads, including common kernels like SGEMM, FFT, Stencils • Parallelized with MPI, with support for multi-GPU and cluster scale comparisons • Implement in both CUDA and OpenCL for a 1:1 comparison • Include system, stability tests SHOC Results Browser (beta)http://ft.ornl.gov/~kspafford/shoctown/Shoctown.html

  3. Download SHOC • Source code • http://ft.ornl.gov/doku/shoc/downloads • Build and Run • http://ft.ornl.gov/doku/shoc/gettingstarted • sh ./conf/config-keeneland.sh • make • cd tools • perl driver.pl –cuda –s 4 • Includes example output for Keeneland • FAQ • http://ft.ornl.gov/doku/shoc/faq

  4. SHOC Categories • Performance: • Level 0 • Speeds and feeds: raw FLOPS rates, bandwidths, latencies • Level 1 • Algorithms: FFT, matrix multiply, stencil, sort, etc. • Level 2: • Application kernels: S3D (chemistry), molecular dynamics • System: • PCIe Contention, MPI latency vs. host-device bandwidth, NUMA • Stability: • FFT-based, error detection

  5. (Level 0 Example): DeviceMemory • Motivation • Determine sustainable device memory bandwidth • Benchmark local, global, and image memory • Basic design • Test different memory access patterns, i.e. coalesced, uncoalesced • Measure both read and write bandwidth • Vary number of threads in a block Coalesced Thread 1 Thread 2 Thread sequential /Uncoalesced Thread 3 Thread 4

  6. SHOC: Level 0 Tests • BusSpeedDownload/Readback • Measures bandwidth/latency of the PCIe bus • DeviceMemory • Measures global/constant/shared memory • KernelCompilation • Measures OpenCL JIT kernel compilation speeds • MaxFlops • Measures achievable FLOPS (synthetic, not-bandwidth bound) • QueueDelay • Measures OpenCL queueing system overhead

  7. (Level 1 Example): Stencil2D • Motivation • Supports investigation of acceleratorusage within parallel application context • Serial and True Parallel versions • Basic design • 9-point stencil operation applied to 2D data set • MPI uses 2D Cartesian data distribution, with periodic halo exchanges • Applies stencil to data in local memory • OpenCL/CUDA observations • Runtime dominated by data movement • Between host and card • Between MPI processes

  8. SHOC: Level 1 Tests • FFT • Reduction • Scan • SGEMM • Sort • SpMV • Stencil2D • Triad

  9. (Level 2 Example): S3D • Motivation • Measure performance of important DOE application • S3D solves Navier-Stokes equations for a regular 3D domain, used to simulate combustion • Basic design • Assign each grid point to a device thread • Highly parallel, as grid points are independent • OpenCL/CUDA observations • CUDA outperforms OpenCL • Big factor: native transcendentals (sin, cos, tan, etc.) 3D Regular Domain Decomposition – Each thread handles a grid point, blocks handle regions

  10. SHOC: Other Tests • Stability • FFTs are sensitive to small errors • Repeated simultaneous FFT/iFFT • Parallel for testing large systems • System • MPI contention • Impact of GPU usage on MPI latency • Chipset contention • Impact of MPI communication on GPU performance • NUMA • Multi-socket, multiple PCIe slot, multiple RAM banks

  11. Compare OpenCL and CUDA • OpenCL improving, but still trailing CUDA • Tesla C2050, CUDA\OpenCL 3.2 RC2

  12. Example Results

  13. Reduction Walkthrough

  14. Reduction Walkthrough • Fundamental kernel in almost all programs • Easy to implement, but hard to get right • We’ll walk through the optimization process. • Code for these kernels is at http://ft.ornl.gov/~kspafford/tutorial.tgz • Graphics from a similar presentation by Mark Harris, NVIDIA

  15. Reduction Walkthrough • Start with the well-known, tree-based approach:

  16. Algorithm Sketch • Launch 1 thread per element • Each thread loads an element from global memory into shared memory • Each block reduces its shared memory into 1 value • This value is written back out to global memory

  17. Algorithm Sketch with Code • Main steps • Each thread loads a value from global memory into shared memory extern__shared__floatsdata[]; unsigned inttid = threadIdx.x; unsigned inti = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; • Synchronize threads __syncthreads(); • Reduce shared memory into a single value • Write value out to global memory

  18. Reduction of Shared Memory for(int s = 1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); }

  19. Reduction v0 • Problem 1: Divergent Warps • Problem 2: Modulo operator is expensive on GPUs

  20. Recursive Invocation • Problem: We want a single value, but blocks can’t communicate • Solution: Recursive kernel invocation

  21. Performance

  22. Reduction v1 • Get rid of divergent branch and modulo operator

  23. Reduction v1

  24. Problem – Bank Conflicts • Shared memory is composed of 32 banks. • When multiple threads access *different* words in the *same* bank, access is serialized

  25. Performance

  26. Reduction v2 – Sequential Addressing

  27. Performance

  28. Reduction v3 – Unrolling the Last Warp • We know threads execute in a warp-synchronous fashion • For the last few steps, we can get rid of extra __syncthreads() calls

  29. Performance

  30. Reduction v4 – Multiple Elements Per Thread • Still have some instruction overhead • Can use templates to totally unroll the loop • Can have threads handle multiple elements from global memory • Bonus: reduces any array size to 2 kernel invocations • This is a useful optimization for most kernels

  31. Performance

  32. More about Reduction • http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf • Demo wget http://ft.ornl.gov/~kspafford/tutorial.tgz

  33. Programming Problem - Scan • Now let’s think about how this extends to Scan (aka prefix sum) Scan takes a binary associative operator ⊕, and an array of n elements: [a0, a1, …, an-1], and returns the array [a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-1)]. Example: If ⊕ is addition [3 1 7 0 4]  [3 4 11 11 15]

  34. Reduce-then-scan Strategy 7 3 8 5 5 1 2 6 Kernel 1: Reduce 10 13 6 8

  35. Reduce-then-scan Strategy 7 3 8 5 5 1 2 6 Kernel 1: Reduce 10 13 6 8 Kernel 2: Exclusive Top-level scan 0 10 23 29

  36. Reduce-then-scan Strategy 7 3 8 5 5 1 2 6 Kernel 1: Reduce 10 13 6 8 Kernel 2: Exclusive Top-level scan 0 10 23 29 Kernel 3: Bottom-level scan 7 10 18 23 28 29 31 37

  37. Fast Scan Kernel • Use 2x shared memory as there are elements, set first half to 0, second half to input. 0 0 0 0 10 13 6 8 0 0 0 0 10 23 19 14 0 0 0 0 10 23 29 37 • fori=0; i < log2blockSize; i++) • smem[idx] += smem[idx-2i];

  38. Example Code • Kernel Found in SHOC (src/level1/scan/scan_kernel.h) in the scanLocalMem function • You can adapt this function for the top-level exclusive scan and the bottom-level inclusive scans. • Problems: • Determine how reduction should stride across global memory • Figure out how to make it exclusive/inclusive (hint: remember the first half of smem is 0) • Figure out how to use the scan kernel for the bottom level scan

  39. Good Luck! Further reading on Scan: • Examples in the CUDA SDK • http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf • http://back40computing.googlecode.com/svn/wiki/documents/ParallelScanForStreamArchitecturesTR.pdf

  40. Motivation Classic n-body pairwise computation, important to all MD codes such as GPU-LAMMPS, AMBER, NAMD, Gromacs, Charmm Basic design Computation of the Lennard Jones potential force 3D domain, random distribution Neighbor list algorithm MD Walkthrough

  41. Algorithm Sketch for each atom, i { force = 0; for each neighbor, j { dist = distance(pos[i],pos[j]); if (dist < cutoff) force += interaction(i,j); } forces[i] = force; }

  42. Performance Observations • Neighbors are data-dependent • Results in an uncoalesced read on pos[j]. • Uncoalesced reads kill performance • But sometimes the texture cache can help

  43. (in cuda/level1/md/MD.cu)

  44. Performance on Keeneland

  45. For the Hands-On Session • Scan Competition • Goal: Best performance on 16 MiB input of floats. • Use this slide deck and knowledge from the other presentations • Test harness with timing and correctness check provided in scan.cu • Download it from ft.ornl.gov/~kspafford/tutorial.tgz • Email submissions (scan.cu) to kys@ornl.gov • I will announce the winner at the end of the hands-on session.

  46. Thanks! kys@ornl.gov

More Related