230 likes | 586 Views
Çankaya University Computer Engineering Department. Parallel GPU Programming with NVIDIA Cuda. Ahmet Artu YILDIRIM. January 20 10. Overview. Introduction and Comparison between CPU & GPU The Execution Model The Memory Model CUDA API Basics and Sample Kernel Function Case Study
E N D
Çankaya University Computer Engineering Department Parallel GPU Programming with NVIDIA Cuda Ahmet Artu YILDIRIM January 2010
Overview • Introduction and Comparison between CPU & GPU • The Execution Model • The Memory Model • CUDA API Basics and Sample Kernel Function • Case Study • Other GPU Programming Models Parallel GPU Programming with NVIDIA Cuda
Introduction • Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computationalpower and very high memory bandwidth • CUDA (Compute Unified Device Architecture) is the parallel computing engine in NVIDIA GPUs that is accessible to software developers through standard programming languages like C and Fortran. Parallel GPU Programming with NVIDIA Cuda
Comparisonbetween CPU and GPU Floating-Point Operations Per Second Memory Bandwidth Parallel GPU Programming with NVIDIA Cuda
Comparisonbetween CPU and GPU The GPU devotes more transistors to data processing The GPU is especially well-suited to data-parallel computations in nature Parallel GPU Programming with NVIDIA Cuda
Execution Model Thread: Smallest execution unit Block: Collection of threads Grid: Highest level Parallel GPU Programming with NVIDIA Cuda
Memory Model Parallel GPU Programming with NVIDIA Cuda
CUDA API Basics • Extension to the C programming language • Cuda source file compiled by nvcc.exe program • Function and variable type qualifiers to specify execution on host or device • Built-in variables that specify the grid and block dimension in kernel function Parallel GPU Programming with NVIDIA Cuda
CUDA API Basics • Function type qualifiers • __device__ • Executed on the device • Callable from the device only. • __global__ • Executed on the device, • Callable from the host only. • __host__ • Executed on the host, • Callable from the host only. Parallel GPU Programming with NVIDIA Cuda
CUDA API Basics • Variable Type Qualifiers • __device__ • Resides in global memory space, • Has the lifetime of an application, • Is accessible from all the threads within the grid and from the host through the runtime library. • __constant__ (optionallyused together with __device__) • Resides in constant memory space, • Has the lifetime of an application, • Is accessible from all the threads within the grid and from the host through the runtime library. • __shared__ (optionally used together with __device__) • Resides in the shared memory space of a thread block, • Has the lifetime of the block, • Is only accessible from all the threads within the block. Parallel GPU Programming with NVIDIA Cuda
ExecutionFlow Parallel GPU Programming with NVIDIA Cuda
CUDA API Basics (Sample 1) • // Kernel definition • __global__ void MatAdd(float A[N][N], float B[N][N], • float C[N][N]) • { • int i = threadIdx.x; • int j = threadIdx.y; • C[i][j] = A[i][j] + B[i][j]; • } • int main() • { • // Kernel invocation • dim3 dimBlock(N, N); • MatAdd<<<1, dimBlock>>>(A, B, C); • } Parallel GPU Programming with NVIDIA Cuda
CUDA API Basics (Sample 2) • __global__ void square_array(float *a, int N) • { • int idx = blockIdx.x * blockDim.x + threadIdx.x; • if (idx<N) • a[idx] = a[idx] * a[idx]; • } • int _tmain(int argc, _TCHAR* argv[]) • { • //initialize a_h before gpu calculation • cudaMalloc((void **) &a_d, size); // Allocate array on device • cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); • int block_size = 100; • int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); • square_array <<< n_blocks, block_size >>> (a_d, N); • cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); • free(a_h); • cudaFree(a_d); • } Parallel GPU Programming with NVIDIA Cuda
Case Study: Comparison of Data Intensive and Compute Intensive GPU Program
Device Configurations • Graphics Adapter: • GeForce 8600M GT • Bus: PCI Express x16 • Stream processors: 32 • Core clock: 475 MHz • Video memory: 256 MB • Memory Interface: 128 bit • Memory clock: 702 MHz (1404 MHz data rate) • CPU: • Processor: Intel(R) Core(TM)2 Duo Mobile Processor T9300 • CPU Speed: 2.50 GHz • Bus Speed: 800 MHz • L2 Cache Size: 6 MB • Memory: 3.00 GB Parallel GPU Programming with NVIDIA Cuda
CPU & GPU Benchmark Scaled Running Time Comparison Graph Parallel GPU Programming with NVIDIA Cuda
Other General Purpose GPU Models Programming Models: OpenCL: Open industry standard by Khronos group Microsoft Direct Compute GPU Processing Adapters: AMD FireStream Parallel GPU Programming with NVIDIA Cuda
Questions? Thank You