1 / 31

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator . Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia. GPUs and CPUs on a collision course 1 st GPUs with programmable shaders in 2001

clove
Download Presentation

Analyzing CUDA Workloads Using a Detailed GPU Simulator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia

  2. GPUs and CPUs on a collision course • 1st GPUs with programmable shaders in 2001 • Today: TeraFlop on a single card. Turing complete. Highly accessible: senior undergrad students can learn to program CUDA in a few weeks (not good perf. code) • Rapidly growing set of CUDA applications (209 listed on NVIDIA’s CUDA website in February). • With OpenCL safely expect number of non-graphics applications written for GPUs to explode. • GPUs are massively parallel systems: • Multicore + SIMT + fine grain multithreaded

  3. No academic detailed simulator for studying this?!?

  4. GPGPU-Sim • An academic detailed (“cycle-level”) timing simulator developed from the ground up at the University of British Columbia (UBC) for modeling a modern GPU running non-graphics workloads. • Relatively accurate (no effort expended trying to make it more accurate relative to real hardware)

  5. GPGPU-Sim • Currently supports CUDA version 1.1 applications “out of the box”. • Microarchitecture model • Based on notion of “shader cores” which approximate NVIDIA GeForce 8 series and above notion of “Streaming Multiprocessor”. • Connect to memory controllers using a detailed network-on-chip simulator (Dally & Towles’ booksim) • Detailed DRAM timing model (everything except refresh) • GPGPU-Sim v2.0b available: www.gpgpu-sim.org

  6. Rest of this talk • Obligatory brief introduction to CUDA • GPGPU-Sim internals (100,000’ view) • Simulator software overview • Modeled Microarchitecture • Some results from the paper

  7. CUDA Example Runs on CPU main() { … cudaMalloc((void**) &d_idata, bytes); cudaMalloc((void**) &d_odata, maxNumBlocks*sizeof(int)); cudaMemcpy(d_idata, h_idata, bytesin, cudaMemcpyHostToDevice); reduce<<<nthreads, nblocks, smemSize >>>(d_idata, d_odata); cudaThreadSynchronize(); cudaMemcpy(d_odata, h_odata, bytesout, cudaMemcpyDeviceToHost); … } __global__ void reduce(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); for(unsigned int s=1; s < blockDim.x; s *= 2) { if ((tid % (2*s)) == 0) sdata[tid] += sdata[tid + s]; __syncthreads(); } if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } nthreads x nblocks copies run in Parallel on GPU

  8. Normal CUDA Flow • Applications written in a mixture of C/C++ and CUDA. • “nvcc” takes CUDA (.cu) files and generates host C code and “Parallel Thread eXecution” assembly language (PTX). • PTX is passed to assembler / optimizer “ptxas” to generate machine code that is packed into a C array (not human readable). • Combine whole thing and link to CUDA runtime API using regular C/C++ compiler linker. • Run your app on the GPU.

  9. GPGPU-Sim Flow • Uses CUDA nvcc to generate CPU C code and PTX. • flex/bison parser reads in PTX. • Link together host (CPU) code and simulator into one binary. • Intercept CUDA API calls using custom libcuda that implements functions declared in header files that come with CUDA.

  10. GPGPU-Sim Microarchitecture • Set of “shader cores” connected to set of memory controllers via a detailed interconnection network model (booksim). • Memory controllers reorder requests to reduce activate /precharge overheads. • Vary topology / bandwidth of interconnect • Cache for global memory operations.

  11. Shader Core Details • Shader core roughly like a “Streaming Multiprocessor” in NVIDIA terminology. • Set of scalar threads grouped together into an SIMD unit called a “warp” (NVIDIA uses 32 on current hardware). Warps grouped into CTAs. CTAs grouped into “grids”. • Set of warps on a core are fine grain interleaved on pipeline to hide off-chip memory access latency. • Threads in one CTA can communicate via an on chip 16KB “shared memory”.

  12. Baseline: Mesh Variations: Crossbar, Ring, Torus Baseline mesh memory controller placement: Interconnection Network

  13. Are more threads better? • More CTAs on a core • Helps hide the latency when some wait for barriers • Can increase memory latency tolerance • Needs more resources • Less CTAs on a core • Less contention in interconnection and memory system

  14. Memory Access Coalescing • Grouping accesses from multiple, concurrently issued, scalar threads into a single access to a contiguous memory region • Is always done for a single warp • Coalescing among multiple warps • We explore its performance benefits • Is more expensive to implement

  15. Simulation setup

  16. Benchmark Selection • Applications developed by 3rd party researchers • Less than 50x reported speedups • + some applications from CUDA SDK

  17. Benchmarks (more info in paper)

  18. Interconnection Network Latency Sensitivity • Slight increase in interconnection latency has no severe effect of overall performance • No need to overdesign interconnection to decrease latency

  19. Interconnection Network Bandwidth Sensitivity • Low Bandwidth decreases performance a lot (8B) • Very high bandwidth moves the bottleneck

  20. Effects of varying number of CTAs • Most benchmarks do not benefit substantially • Some benchmarks even perform better with fewer concurrent threads (e.g. AES) • Less contention in DRAM

  21. More insights and data in the paper…

  22. Summary • GPGPU-Sim: a novel GPU simulator • Capable of simulating CUDA applications • www.gpgpu-sim.org • Performance of simulated applications • More sensitive to bisection BW • Less sensitive to (zero load) Latency • Sometimes running fewer CTAs can improve performance (less DRAM contention)

  23. Interconnect Topology (Fig 9)

  24. ICNT Latency and BW sensitivity (Fig 10-11)

  25. Mem Controller Optimization Effects (Fig 12)

  26. DRAM Utilization and Efficiency (Fig 13 -14)

  27. L1 / L2 Cache (Fig 15)

  28. Varying CTAs (Fig 16)

  29. Inter-Warp Coalescing (Fig 17)

More Related