400 likes | 549 Views
Algorithm Engineering „ GPGPU“. Stefan Edelkamp. Graphics Processing Units. GPGPU = (GP)²U General Purpose Programming on the GPU „ Parallelism for the masses “ Application : Fourier-Transformation, Model Checking , Bio- Informatics , see CUDA-ZONE.
E N D
Algorithm Engineering „GPGPU“ Stefan Edelkamp
Graphics Processing Units • GPGPU = (GP)²U General PurposeProgramming on the GPU • „Parallelismforthemasses“ • Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE
Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example
Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example
Cluster / Multicore / GPU CPU RAM • Cluster system • many unique systems • each one • one (or more) processors • internal memory • often HDD • communication over network • slow compared to internal • no shared memory HDD CPU RAM Switch HDD CPU RAM HDD
Cluster / Multicore / GPU • Multicore systems • multiple CPUs • RAM • external memory on HDD • communication over RAM CPU1 CPU2 CPU3 CPU4 RAM HDD
Cluster / Multicore / GPU • System with a Graphic Processing Unit • Many (240) Parallel processing units • Hierarchical memory structure • RAM • VideoRAM • SharedRAM • Communication • PCI BUS Graphics Card CPU VRAM GPU RAM SRAM Hard Disk Drive
Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example
Computing on the GPU • Hierarchical execution • Groups • executed sequentially • Threads • executed parallel • lightweight (creation / switching nearly free) • one Kernel function • executed by each thread Group 0
Computing on the GPU • Hierarchical memory • Video RAM • 1 GB • Comparable to RAM • Shared RAM in the GPU • 16 KB • Comparable to registers • parallel access by threads Graphic Card VideoRAM GPU SRAM
Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example
GPGPU Languages • RapidMind • Supports MultiCore, ATI, NVIDIA and Cell • C++ analysed and compiled for target hardware • Accelerator (Microsoft) • Library for .NET language • BrookGPU (Stanford University) • Supports ATI, NVIDIA • Own Language, variant of ANSI C
Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • Programming languages • CUDA • Small Example
CUDA • Programming language • Similar to C • File suffix .cu • Own compiler called nvcc • Can be linked to C
CUDA C++ code CUDA Code Compile with GCC Compile with nvcc Link with ld Executable
CUDA • Additional variable types • Dim3 • Int3 • Char3
CUDA • Different types of functions • __global__ invoked from host • __device__ called from device • Different types of variables • __device__ located in VRAM • __shared__ located in SRAM
CUDA • Calling the kernel function • name<<<dim3 grid, dim3 block>>>(...) • Grid dimensions (groups) • Block dimensions (threads)
CUDA • Memory handling • CudaMalloc(...) - allocating VRAM • CudaMemcpy(...) - copying Memory • CudaFree(...) - free VRAM
CUDA • Distinguish threads • blockDim – Number of all groups • blockIdx – Id of Group (starting with 0) • threadIdx – Id of Thread (starting with 0) • Id = blockDim.x*blockIdx.x+threadIdx.x
Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • Programming languages • CUDA • Small Example
__global__ void inc(int *a, int b, int N) { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main() { ... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc<<<dimGrid,dimBlock>>>(a_d,b,N); } CUDA void inc(int *a, int b, int N) { for (inti = 0; i<N; i++) a[i] = a[i] + b; } void main() { ... inc(a,b,N); }
Realworld Example • LTL Model checking • Traversing an implicit Graph G=(V,E) • Vertices called states • Edges represented by transitions • Duplicate removal needed
Realworld Example • External Model checking • Generate Graph with external BFS • Each BFS layer needs to be sorted • GPU proven to be fast in sorting
Realworld Example • Challenges • Millions of states in one layer • Huge state size • Fast access only in SRAM • Elements needs to be moved
Realworld Example • Solutions: • Gpuqsort • Qsort optimized for GPUs • Intensive swapping in VRAM • Bitonic based sorting • Fast for subgroups • Concatenating Groups slow
SRAM VRAM Realworld Example • Our solution • States S presorted by Hash H(S) • Bucket sorted in SRAM by a Group
Realworld Example • Our solution • Order given by H(S),S
Realworld Example • Results
Programming the GPU • Questions???