1 / 36

Automated Dynamic Analysis of CUDA Programs

Automated Dynamic Analysis of CUDA Programs. Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research. Outline. GPGPU CUDA Automated analyses Correctness: race conditions

kalea
Download Presentation

Automated Dynamic Analysis of CUDA Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Dynamic Analysisof CUDA Programs Michael Boyer, Kevin Skadron*, and Westley Weimer University of Virginia {boyer,skadron,weimer}@cs.virginia.edu * currently on sabbatical with NVIDIA Research

  2. Outline • GPGPU • CUDA • Automated analyses • Correctness: race conditions • Performance: bank conflicts • Preliminary results • Future work • Conclusion

  3. Why GPGPU? From: NVIDIA CUDA Programming Guide, Version 1.1

  4. CPU vs. GPU Design Single-Thread Latency Aggregate Throughput From: NVIDIA CUDA Programming Guide, Version 1.1

  5. GPGPU Programming • Traditional approach: graphics APIs • ATI/AMD: Close-to-the-Metal (CTM) • NVIDIA: Compute Unified Device Architecture (CUDA)

  6. CUDA: Abstractions • Kernel functions • Scratchpad memory • Barrier synchronization

  7. __host__void example(int *cpu_mem) { cudaMalloc(&gpu_mem, mem_size); cudaMemcpy(gpu_mem, cpu_mem, HostToDevice); kernel <<< grid, threads, mem_size >>> (gpu_mem); cudaMemcpy(cpu_mem, gpu_mem, DeviceToHost); } __global__void kernel(int *mem) { int thread_id = threadIdx.x; mem[thread_id] = thread_id; } CUDA: Example Program

  8. Multiprocessor Per-Block Shared Memory (PBSM) Registers Registers Registers Multiprocessor 1 Multiprocessor N Processing Element 1 Processing Element 2 Processing Element M Instruction Unit ● ● ● ● ● ● CUDA: Hardware GPU Multiprocessor 2 Global Device Memory

  9. Outline • GPGPU • CUDA • Automated analyses • Correctness: race conditions • Performance: bank conflicts • Preliminary results • Future work • Conclusion

  10. Race Conditions • Ordering of instructions among multiple threads is arbitrary • Relaxed memory consistency model • Synchronization: __syncthreads() • Barrier / memory fence

  11. W W s threads W 0 0 R W 1 1 R W 2 2 R W 3 3 R 4 4 R 5 5 R Race Conditions: Example 1 extern__shared__int s[ ]; 2 3 __global__void kernel(int *out) { 4 int id = threadIdx.x; 5 int nt = blockDim.x; 6 7 s[id] = id; 8 out = s[(id + 1) % nt]; 9 } 8 out = s[(id + 1) % nt];

  12. Automatic Instrumentation Original CUDA Source Code Intermediate Representation Compile Execute Instrumentation Instrumented CUDA Source Code Output: Race Conditions Detected?

  13. Race Condition Instrumentation • Two global bookkeeping arrays: • Reads & writes of all threads • Two per-thread bookkeeping arrays: • Reads & writes of a single thread • After each shared memory access: • Update bookkeeping arrays • Detect & report race conditions

  14. Race Condition Detection Add synchronization between lines 7 and 8 No race conditions detected Original code RAW hazard at expression: #line 8 out[id] = s[(id + 1) % nt];

  15. Outline • GPGPU • CUDA • Automated analyses • Correctness: race conditions • Performance: bank conflicts • Preliminary results • Future work • Conclusion

  16. Bank Conflicts • PBSM is fast • Much faster than global memory • Potentially as fast as register access • …assuming no bank conflicts • Bank conflicts cause serialized access

  17. Threads Threads 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Banks Banks Stride = 3 Non-Conflicting Access Patterns Stride = 1

  18. Threads Threads Stride = 4 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 Banks Banks Stride = 16 Conflicting Access Patterns

  19. Impact of Bank Conflicts

  20. Automatic Instrumentation Original CUDA Source Code Intermediate Representation Compile Execute Instrumentation Instrumented CUDA Source Code Output: Race Conditions Detected? Output: Bank Conflicts Detected?

  21. Bank Conflict Instrumentation • Global bookkeeping array: • Tracks address accessed by each thread • After each PBSM access: • Each thread updates its entry • One thread computes and reports bank conflicts

  22. Bank Conflict Detection CAUSE_BANK_CONFLICTS = true Bank conflicts at: #line 14 mem[j]++ Bank: 0 1 2 3 4 5 6 7 8 9 … Accesses: 16 0 0 0 0 0 0 0 0 0 … CAUSE_BANK_CONFLICTS = false No bank conflicts at: #line 14 mem[j]++

  23. Preliminary Results • Scan • Included in CUDA SDK • All-prefix sums operation • 400 lines of code • Explicitly prevents race conditions and bank conflicts

  24. Preliminary Results:Race Condition Detection • Original code: • No race conditions detected • Remove any synchronization calls: • Race conditions detected

  25. Preliminary Results:Bank Conflict Detection • Original code: • Small number of minor bank conflicts • Enable bank conflict avoidance macro: • Bank conflicts increased! • Confirmed by manual analysis • Culprit: incorrect emulation mode

  26. Instrumentation Overhead • Two sources: • Emulation • Instrumentation • Assumption: for debugging, programmers will already use emulation mode

  27. Instrumentation Overhead

  28. Future Work • Find more types of bugs • Correctness: array bounds checking • Performance: memory coalescing • Reduce instrumentation overhead • Execute instrumented code natively

  29. Conclusion • GPGPU: enormous performance potential • But parallel programming is challenging • Automated instrumentation can help • Find synchronization bugs • Identify inefficient memory accesses • And more…

  30. Questions? Instrumentation tool will be available at: http://www.cs.virginia.edu/~mwb7w/cuda

  31. Domain Mapping From: NVIDIA CUDA Programming Guide, Version 1.1

  32. Coalesced Accesses From: NVIDIA CUDA Programming Guide, Version 1.1

  33. Non-Coalesced Accesses From: NVIDIA CUDA Programming Guide, Version 1.1

  34. Race Condition Detection Algorithm • A thread t knows a race condition exists at shared memory location m if: • Location m has been read from and written to • One of the accesses to m came from t • One of the accesses to m came from a thread other than t • Note that we are only checking for RAW and WAR hazards

  35. Bank Conflicts: Example extern __shared__ int mem[]; __global__ void kernel(int iters) { int min, stride, max, id = threadIdx.x; if (CAUSE_BANK_CONFLICTS) // Set stride to cause bank conflicts else // Set stride to avoid bank conflicts for (int i = 0; i < iters; i++) for (int j = min; j < max; j += stride) mem[j]++; }

  36. extern __shared__ int s[] ; __global__ void kernel(void) ; void kernel(void) { // Instrumentation code int block_size = blockDim.x * blockDim.y * blockDim.z; int thread_id = threadIdx.x + (threadIdx.y * blockDim.x) + (threadIdx.z * blockDim.x * blockDim.y); __shared__ char mem_reads[PUT_ARRAY_SIZE_HERE]; __shared__ char mem_writes[PUT_ARRAY_SIZE_HERE]; if (thread_id == 0) { for (int i = 0; i < block_size; i++) { mem_reads[i] = 0; mem_writes[i] = 0; } } __syncthreads(); char hazard = 0; int id ; int nt ; int temp ; { id = (int )threadIdx.x; nt = (int )((blockDim.x * blockDim.y) * blockDim.z); //#line 9 s[id] = id; // Instrumentation code mem_writes[id] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] && mem_writes[i]) { hazard = 1; break; } } if (hazard) printf("WAR hazard at expression: #line 9 s[id] = id;\n"); hazard = 0; } //#line 10 temp = s[((nt + id) - 1) % nt]; // Instrumentation code mem_reads[((nt + id) - 1) % nt] = 1; __syncthreads(); if (thread_id == 0) { for (int i = 0; i < block_size; i++) { if (mem_reads[i] && mem_writes[i]) { hazard = 1; break; } } if (hazard) printf("RAW hazard at expression: #line 10 temp = s[((nt + id) - 1) %% nt];\n"); hazard = 0; } //#line 11 return; } } Instrumented Code Example Original Code extern __shared__ int s[]; __global__ void kernel() { int id = threadIdx.x; int nt = blockDim.x * blockDim.y * blockDim.z; s[id] = id; int temp = s[(nt+id-1) % nt]; } RAW hazard at expression: #line 10 temp = s[((nt + id) - 1) % nt]; Instrumentation

More Related