Algorithm Engineering „ GPGPU“

Algorithm Engineering „GPGPU“ Stefan Edelkamp

Graphics Processing Units • GPGPU = (GP)²U General PurposeProgramming on the GPU • „Parallelismforthemasses“ • Application: Fourier-Transformation, Model Checking, Bio-Informatics, see CUDA-ZONE

Programming the Graphics Processing Unitwith Cuda

Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • GPGPU languages • CUDA • Small Example

Cluster / Multicore / GPU CPU RAM • Cluster system • many unique systems • each one • one (or more) processors • internal memory • often HDD • communication over network • slow compared to internal • no shared memory HDD CPU RAM Switch HDD CPU RAM HDD

Cluster / Multicore / GPU • Multicore systems • multiple CPUs • RAM • external memory on HDD • communication over RAM CPU1 CPU2 CPU3 CPU4 RAM HDD

Cluster / Multicore / GPU • System with a Graphic Processing Unit • Many (240) Parallel processing units • Hierarchical memory structure • RAM • VideoRAM • SharedRAM • Communication • PCI BUS Graphics Card CPU VRAM GPU RAM SRAM Hard Disk Drive

Computing on the GPU • Hierarchical execution • Groups • executed sequentially • Threads • executed parallel • lightweight (creation / switching nearly free)‏ • one Kernel function • executed by each thread Group 0

Computing on the GPU • Hierarchical memory • Video RAM • 1 GB • Comparable to RAM • Shared RAM in the GPU • 16 KB • Comparable to registers • parallel access by threads Graphic Card VideoRAM GPU SRAM

Beispielarchitektur G200 z.B. in 280GTX

Beispielprobleme

Ranking und Unranking mit Parity

2-Bit BFS

1-Bit BFS

Schiebepuzzle

SomeResults…

Weitere Resultate …

GPGPU Languages • RapidMind • Supports MultiCore, ATI, NVIDIA and Cell • C++ analysed and compiled for target hardware • Accelerator (Microsoft)‏ • Library for .NET language • BrookGPU (Stanford University)‏ • Supports ATI, NVIDIA • Own Language, variant of ANSI C

Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • Programming languages • CUDA • Small Example

CUDA • Programming language • Similar to C • File suffix .cu • Own compiler called nvcc • Can be linked to C

CUDA C++ code CUDA Code Compile with GCC Compile with nvcc Link with ld Executable

CUDA • Additional variable types • Dim3 • Int3 • Char3

CUDA • Different types of functions • __global__ invoked from host • __device__ called from device • Different types of variables • __device__ located in VRAM • __shared__ located in SRAM

CUDA • Calling the kernel function • name<<<dim3 grid, dim3 block>>>(...)‏ • Grid dimensions (groups)‏ • Block dimensions (threads)‏

CUDA • Memory handling • CudaMalloc(...) - allocating VRAM • CudaMemcpy(...) - copying Memory • CudaFree(...) - free VRAM

CUDA • Distinguish threads • blockDim – Number of all groups • blockIdx – Id of Group (starting with 0)‏ • threadIdx – Id of Thread (starting with 0)‏ • Id = blockDim.x*blockIdx.x+threadIdx.x

Overview • Cluster / Multicore / GPU comparison • Computing on the GPU • Programming languages • CUDA • Small Example

__global__ void inc(int *a, int b, int N)‏ { int id = blockDim.x*blockIdx.x+threadIdx.x; if (id<N) a[id] = a[id] + b; } void main()‏ { ... int * a_d = CudaAlloc(N); CudaMemCpy(a_d,a,N,HostToDevice); dim3 dimBlock ( blocksize, 0, 0 ); dim3 dimGrid ( N / blocksize, 0, 0 ); inc<<<dimGrid,dimBlock>>>(a_d,b,N); } CUDA void inc(int *a, int b, int N) { for (inti = 0; i<N; i++) a[i] = a[i] + b; } void main()‏ { ... inc(a,b,N); }

Realworld Example • LTL Model checking • Traversing an implicit Graph G=(V,E)‏ • Vertices called states • Edges represented by transitions • Duplicate removal needed

Realworld Example • External Model checking • Generate Graph with external BFS • Each BFS layer needs to be sorted • GPU proven to be fast in sorting

Realworld Example • Challenges • Millions of states in one layer • Huge state size • Fast access only in SRAM • Elements needs to be moved

Realworld Example • Solutions: • Gpuqsort • Qsort optimized for GPUs • Intensive swapping in VRAM • Bitonic based sorting • Fast for subgroups • Concatenating Groups slow

SRAM VRAM Realworld Example • Our solution • States S presorted by Hash H(S) • Bucket sorted in SRAM by a Group

Realworld Example • Our solution • Order given by H(S),S

Realworld Example • Results

Programming the GPU • Questions???

Algorithm Engineering „ GPGPU“