130 likes | 158 Views
Dive deep into CUDA profiling, architecture details, instrumentation techniques, and performance optimization strategies for future development in this Bachelor Presentation Session from September 2012. Gain knowledge on event-based profilers, statistical profilers, and instrumenting profilers like Gprof, Oprofile, AMD CodeAnalyst, and more. Explore CUDA programming essentials and code representations, with examples and insights on GPU Ocelot, CUDA-GDB, and Visual Profiler. Understand performance measurement methods like random initialization and matrix multiplication through code snippets and instrumentation examples. Learn about performance challenges and future development solutions, including optimizing instrumented code and exploring new interfaces for improved efficiency. Engage in discussions around MD5 cracking and brute force kernels, and explore the potential of open-source PTX parsers for enhanced performance.
E N D
CUDA Profiling Bachelor Presentation Session - September 2012 Marian-Cristian Rotariumarian.c.rotariu@gmail.com
Contents • Profiling • Architecture • Instrumentation • Performance • Future development • Questions 16.09.12 Bachelor Presentation Session - July 2010 2
Profiling Program analysis Types: Static Dynamic Other types: Event-based profilers JVM Tools Interface, .NET Profiling API Statistical profilers Gprof, Oprofile, AMD CodeAnalyst, Intel Vtune Instrumenting profilers Gprof, ATOM Simulator profilers OLIVER, SIMON, GPU Ocelot
CUDA General Programming Architecture Code representation: CUDA C, PTX, cubin nvcc, CUDA-gdb, Visual Profiler
CUDA C Essentials __global__ void helloWorld(char* str) { int idx = blockIdx.x * blockDim.x + threadIdx.x; str[idx] += idx; } __host__ int main(int argc, char** argv) { ... cudaMalloc((void**)&d_str, size); cudaMemcpy(d_str, str, size, cudaMemcpyHostToDevice); helloWorld<<< dimGrid, dimBlock >>>(d_str); ... }
Instrumentations 1 void randomInit(float *data, int size) { gettimeofday(&start_randomInit, NULL); int i; gettimeofday(&start_randomInit_11 , NULL); cudaMalloc((void **)&d_B , mem_size_B ) ; gettimeofday(&stop_randomInit_11 , NULL); gettimeofday(&stop_randomInit, NULL); }
Instrumentations 2 int main () { ... cudaEventRecord(start_MyKernel_10, 0); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); cudaEventRecord(stop_MyKernel_10, 0); cudaEventSynchronize(stop_MyKernel_10); … } __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB) { clock_t start_matrixMul= clock(); ... clock_t stop_matrixMul= clock(); result_matrixMul[ blockIdx.x * blockDim.x + threadIdx.x ] = stop_matrixMul - start_matrixMul; }
Instrumentations 3 int main() { ... registerAlloc(“main”, mem_size_A); cudaMalloc((void**) &d_A, mem_size_A); … registerCopy(“main”, mem_size_A, 0); cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice); ... registerKernel(“main”, “matrixMul”, d_C.x, d_C.y, ...); matrixMul<<< grid, threads >>>(d_C, d_A, d_B, WA, WB); }
MD5 Crack Brute force Each word on one kernel Results on 8800 GTS
Conclusion & Future development Reliable and simple solution NVIDIA competitor Open souce PTX parser Optimization of instrumented code New interface