1 / 13

CUDA Profiling

CUDA Profiling. Bachelor Presentation Session - September 2012. Marian-Cristian Rotariu marian.c.rotariu@gmail.com. Contents. Profiling Architecture Instrumentation Performance Future development Questions. 16.09.12. Bachelor Presentation Session - July 2010. 2. Profiling.

acopeland
Download Presentation

CUDA Profiling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Profiling Bachelor Presentation Session - September 2012 Marian-Cristian Rotariumarian.c.rotariu@gmail.com

  2. Contents • Profiling • Architecture • Instrumentation • Performance • Future development • Questions 16.09.12 Bachelor Presentation Session - July 2010 2

  3. Profiling Program analysis Types: Static Dynamic Other types: Event-based profilers JVM Tools Interface, .NET Profiling API Statistical profilers Gprof, Oprofile, AMD CodeAnalyst, Intel Vtune Instrumenting profilers Gprof, ATOM Simulator profilers OLIVER, SIMON, GPU Ocelot

  4. CUDA General Programming Architecture Code representation: CUDA C, PTX, cubin nvcc, CUDA-gdb, Visual Profiler

  5. CUDA C Essentials __global__ void helloWorld(char* str) { int idx = blockIdx.x * blockDim.x + threadIdx.x; str[idx] += idx; } __host__ int main(int argc, char** argv) { ... cudaMalloc((void**)&d_str, size); cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice); helloWorld<<< dimGrid, dimBlock >>>(d_str); ... }

  6. Architecture

  7. Instrumentations 1 void randomInit(float *data, int size) { gettimeofday(&start_randomInit, NULL); int i; gettimeofday(&start_randomInit_11 , NULL); cudaMalloc((void **)&d_B , mem_size_B ) ; gettimeofday(&stop_randomInit_11 , NULL); gettimeofday(&stop_randomInit, NULL); }

  8. Instrumentations 2 int main () { ... cudaEventRecord(start_MyKernel_10, 0); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); cudaEventRecord(stop_MyKernel_10, 0); cudaEventSynchronize(stop_MyKernel_10); … } __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB) { clock_t start_matrixMul= clock(); ... clock_t stop_matrixMul= clock(); result_matrixMul[ blockIdx.x * blockDim.x + threadIdx.x ] = stop_matrixMul - start_matrixMul; }

  9. Instrumentations 3 int main() { ... registerAlloc(“main”, mem_size_A); cudaMalloc((void**) &d_A, mem_size_A); … registerCopy(“main”, mem_size_A, 0); cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice); ... registerKernel(“main”, “matrixMul”, d_C.x, d_C.y, ...); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); }

  10. Peformance

  11. MD5 Crack Brute force Each word on one kernel Results on 8800 GTS

  12. Conclusion & Future development Reliable and simple solution NVIDIA competitor Open souce PTX parser Optimization of instrumented code New interface

  13. Questions

More Related