130 likes | 152 Views
CUDA Profiling. Bachelor Presentation Session - September 2012. Marian-Cristian Rotariu marian.c.rotariu@gmail.com. Contents. Profiling Architecture Instrumentation Performance Future development Questions. 16.09.12. Bachelor Presentation Session - July 2010. 2. Profiling.
E N D
CUDA Profiling Bachelor Presentation Session - September 2012 Marian-Cristian Rotariumarian.c.rotariu@gmail.com
Contents • Profiling • Architecture • Instrumentation • Performance • Future development • Questions 16.09.12 Bachelor Presentation Session - July 2010 2
Profiling Program analysis Types: Static Dynamic Other types: Event-based profilers JVM Tools Interface, .NET Profiling API Statistical profilers Gprof, Oprofile, AMD CodeAnalyst, Intel Vtune Instrumenting profilers Gprof, ATOM Simulator profilers OLIVER, SIMON, GPU Ocelot
CUDA General Programming Architecture Code representation: CUDA C, PTX, cubin nvcc, CUDA-gdb, Visual Profiler
CUDA C Essentials __global__ void helloWorld(char* str) { int idx = blockIdx.x * blockDim.x + threadIdx.x; str[idx] += idx; } __host__ int main(int argc, char** argv) { ... cudaMalloc((void**)&d_str, size); cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice); helloWorld<<< dimGrid, dimBlock >>>(d_str); ... }
Instrumentations 1 void randomInit(float *data, int size) { gettimeofday(&start_randomInit, NULL); int i; gettimeofday(&start_randomInit_11 , NULL); cudaMalloc((void **)&d_B , mem_size_B ) ; gettimeofday(&stop_randomInit_11 , NULL); gettimeofday(&stop_randomInit, NULL); }
Instrumentations 2 int main () { ... cudaEventRecord(start_MyKernel_10, 0); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); cudaEventRecord(stop_MyKernel_10, 0); cudaEventSynchronize(stop_MyKernel_10); … } __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB) { clock_t start_matrixMul= clock(); ... clock_t stop_matrixMul= clock(); result_matrixMul[ blockIdx.x * blockDim.x + threadIdx.x ] = stop_matrixMul - start_matrixMul; }
Instrumentations 3 int main() { ... registerAlloc(“main”, mem_size_A); cudaMalloc((void**) &d_A, mem_size_A); … registerCopy(“main”, mem_size_A, 0); cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice); ... registerKernel(“main”, “matrixMul”, d_C.x, d_C.y, ...); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); }
MD5 Crack Brute force Each word on one kernel Results on 8800 GTS
Conclusion & Future development Reliable and simple solution NVIDIA competitor Open souce PTX parser Optimization of instrumented code New interface