CUDA Profiling

CUDA Profiling Bachelor Presentation Session - September 2012 Marian-Cristian Rotariumarian.c.rotariu@gmail.com

Contents • Profiling • Architecture • Instrumentation • Performance • Future development • Questions 16.09.12 Bachelor Presentation Session - July 2010 2

Profiling Program analysis Types: Static Dynamic Other types: Event-based profilers JVM Tools Interface, .NET Profiling API Statistical profilers Gprof, Oprofile, AMD CodeAnalyst, Intel Vtune Instrumenting profilers Gprof, ATOM Simulator profilers OLIVER, SIMON, GPU Ocelot

CUDA General Programming Architecture Code representation: CUDA C, PTX, cubin nvcc, CUDA-gdb, Visual Profiler

CUDA C Essentials __global__ void helloWorld(char* str) { int idx = blockIdx.x * blockDim.x + threadIdx.x; str[idx] += idx; } __host__ int main(int argc, char** argv) { ... cudaMalloc((void**)&d_str, size); cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice); helloWorld<<< dimGrid, dimBlock >>>(d_str); ... }

Architecture

Instrumentations 1 void randomInit(float *data, int size) { gettimeofday(&start_randomInit, NULL); int i; gettimeofday(&start_randomInit_11 , NULL); cudaMalloc((void **)&d_B , mem_size_B ) ; gettimeofday(&stop_randomInit_11 , NULL); gettimeofday(&stop_randomInit, NULL); }

Instrumentations 2 int main () { ... cudaEventRecord(start_MyKernel_10, 0); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); cudaEventRecord(stop_MyKernel_10, 0); cudaEventSynchronize(stop_MyKernel_10); … } __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB) { clock_t start_matrixMul= clock(); ... clock_t stop_matrixMul= clock(); result_matrixMul[ blockIdx.x * blockDim.x + threadIdx.x ] = stop_matrixMul - start_matrixMul; }

Instrumentations 3 int main() { ... registerAlloc(“main”, mem_size_A); cudaMalloc((void**) &d_A, mem_size_A); … registerCopy(“main”, mem_size_A, 0); cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice); ... registerKernel(“main”, “matrixMul”, d_C.x, d_C.y, ...); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); }

Peformance

MD5 Crack Brute force Each word on one kernel Results on 8800 GTS

Conclusion & Future development Reliable and simple solution NVIDIA competitor Open souce PTX parser Optimization of instrumented code New interface

Questions

CUDA Profiling

CUDA Profiling

Presentation Transcript

Profiling

Cuda

Profiling

CUDA

Profiling

CUDA

Profiling

CUDA Lecture 8 CUDA Memories

Profiling

CUDA

Profiling

CUDA

Profiling

Profiling

CUDA

CUDA

CUDA Profiling and Debugging