Parallel GPU Programming with NVIDIA Cuda

Çankaya University Computer Engineering Department Parallel GPU Programming with NVIDIA Cuda Ahmet Artu YILDIRIM January 2010

Overview • Introduction and Comparison between CPU & GPU • The Execution Model • The Memory Model • CUDA API Basics and Sample Kernel Function • Case Study • Other GPU Programming Models Parallel GPU Programming with NVIDIA Cuda

Introduction • Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computationalpower and very high memory bandwidth • CUDA (Compute Unified Device Architecture) is the parallel computing engine in NVIDIA GPUs that is accessible to software developers through standard programming languages like C and Fortran. Parallel GPU Programming with NVIDIA Cuda

Comparisonbetween CPU and GPU Floating-Point Operations Per Second Memory Bandwidth Parallel GPU Programming with NVIDIA Cuda

Comparisonbetween CPU and GPU The GPU devotes more transistors to data processing The GPU is especially well-suited to data-parallel computations in nature Parallel GPU Programming with NVIDIA Cuda

Execution Model Thread: Smallest execution unit Block: Collection of threads Grid: Highest level Parallel GPU Programming with NVIDIA Cuda

Memory Model Parallel GPU Programming with NVIDIA Cuda

CUDA API Basics • Extension to the C programming language • Cuda source file compiled by nvcc.exe program • Function and variable type qualifiers to specify execution on host or device • Built-in variables that specify the grid and block dimension in kernel function Parallel GPU Programming with NVIDIA Cuda

CUDA API Basics • Function type qualifiers • __device__ • Executed on the device • Callable from the device only. • __global__ • Executed on the device, • Callable from the host only. • __host__ • Executed on the host, • Callable from the host only. Parallel GPU Programming with NVIDIA Cuda

CUDA API Basics • Variable Type Qualifiers • __device__ • Resides in global memory space, • Has the lifetime of an application, • Is accessible from all the threads within the grid and from the host through the runtime library. • __constant__ (optionallyused together with __device__) • Resides in constant memory space, • Has the lifetime of an application, • Is accessible from all the threads within the grid and from the host through the runtime library. • __shared__ (optionally used together with __device__) • Resides in the shared memory space of a thread block, • Has the lifetime of the block, • Is only accessible from all the threads within the block. Parallel GPU Programming with NVIDIA Cuda

ExecutionFlow Parallel GPU Programming with NVIDIA Cuda

CUDA API Basics (Sample 1) • // Kernel definition • __global__ void MatAdd(float A[N][N], float B[N][N], • float C[N][N]) • { • int i = threadIdx.x; • int j = threadIdx.y; • C[i][j] = A[i][j] + B[i][j]; • } • int main() • { • // Kernel invocation • dim3 dimBlock(N, N); • MatAdd<<<1, dimBlock>>>(A, B, C); • } Parallel GPU Programming with NVIDIA Cuda

CUDA API Basics (Sample 2) • __global__ void square_array(float *a, int N) • { • int idx = blockIdx.x * blockDim.x + threadIdx.x; • if (idx<N) • a[idx] = a[idx] * a[idx]; • } • int _tmain(int argc, _TCHAR* argv[]) • { • //initialize a_h before gpu calculation • cudaMalloc((void **) &a_d, size); // Allocate array on device • cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); • int block_size = 100; • int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); • square_array <<< n_blocks, block_size >>> (a_d, N); • cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); • free(a_h); • cudaFree(a_d); • } Parallel GPU Programming with NVIDIA Cuda

Case Study: Comparison of Data Intensive and Compute Intensive GPU Program

Device Configurations • Graphics Adapter: • GeForce 8600M GT • Bus: PCI Express x16 • Stream processors: 32 • Core clock: 475 MHz • Video memory: 256 MB • Memory Interface: 128 bit • Memory clock: 702 MHz (1404 MHz data rate) • CPU: • Processor: Intel(R) Core(TM)2 Duo Mobile Processor T9300 • CPU Speed: 2.50 GHz • Bus Speed: 800 MHz • L2 Cache Size: 6 MB • Memory: 3.00 GB Parallel GPU Programming with NVIDIA Cuda

CPU & GPU Benchmark Scaled Running Time Comparison Graph Parallel GPU Programming with NVIDIA Cuda

Other General Purpose GPU Models Programming Models: OpenCL: Open industry standard by Khronos group Microsoft Direct Compute GPU Processing Adapters: AMD FireStream Parallel GPU Programming with NVIDIA Cuda

Questions? Thank You

Parallel GPU Programming with NVIDIA Cuda

Parallel GPU Programming with NVIDIA Cuda

Presentation Transcript

Parallel Programming with CUDA

Canny Edge Detection Using an NVIDIA GPU and CUDA

Nvidia CUDA Programming Basics

GPU programming: CUDA

Parallel Event Driven Simulation using GPU (CUDA)

CUDA GPU Programming

GPU Programming with CUDA – Optimisation Mike Griffiths

NVIDIA CUDA

GPU Programming and CUDA

Training Program on GPU Programming with CUDA

GPU Programming with CUDA – Introduction Paul Richmond

Programming with CUDA and Parallel Algorithms

Training Program on GPU Programming with CUDA

NVIDIA CUDA

GPU Programming and CUDA

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

CUDA parallel programming technology

GPU Programming and CUDA