300 likes | 313 Views
Programming Massively Parallel Graphics Processors. Andreas Moshovos Winter 2009. Goals: Graphics Processors Learn how program GPUs Learn how to get performance out of GPUs Understand GPU architecture and limitations CUDA: Compute Unified Device Architecture/NVidia How:
E N D
Programming Massively Parallel Graphics Processors Andreas Moshovos Winter 2009
Goals: • Graphics Processors • Learn how program GPUs • Learn how to get performance out of GPUs • Understand GPU architecture and limitations • CUDA: Compute Unified Device Architecture/NVidia • How: • Weekly assignments for the first few weeks • A large team project • Ideal Scenario: • Non-ECE Non-CS people will team up with CS/ECE and attack an interesting problem
What is a GPU • Specialized processor for graphics • Embarrassingly parallel: • Lots of: • Read data, calculate, write • Used to be fixed function • Are becoming more programmable • What is CUDA • A C extension for programming for NVIDIA GPUs • Straightforward to learn • Challenge is in getting performance
Sequential Execution Model int a[N]; // N is large for (i =0; i < N; i++) a[i] = a[i] * fade; Flow of control / Thread One instruction at the time Optimizations possible at the machine level time
Data Parallel Execution Model / SIMD int a[N]; // N is large for all elements do in parallel a[index] = a[index] * fade; time
Single Program Multiple Data / SPMD int a[N]; // N is large for all elements do in parallel if (a[i] > threshold) a[i]*= fade; time
Programmer’s view – Typical System If you care about performance a lot CPU regs CPU caches Memory Memory 12.8GB/sec – 31.92GB/sec 8B per transfer
Programmer’s view with GPU CPU GPU 3GB/s 141GB/sec Memory 12.8GB/sec – 31.92GB/sec 8B per transfer GPU Memory 1GB on our systems
Programmer’s view with GPU CPU GPU Copy to GPU mem Launch GPU threads Synchronize with GPU Copy from GPU mem time
But what about performance? • Focus on PEAK performance first: • What the manufacturer guarantees you’ll never exceed • Two Aspects: • Data Access Rate Capability • Bandwidth • Data Processing Capability • How many ops per sec
Data Processing Capability • Focus on floating point data • GFLOPS • Billion Floating-Point Operations per Second • Caveat: FOPs can be different • But today things are not as bad as before • High-End CPU today • 3.4Ghz x 8 FOPS/cycle = 27 GFLOPS • Assumes SSE • High-End GPU today / GTX280 • 933.1 GFLOPS or 34x capability
Data Access Capability • High-End CPU Today • 31.92 GB/sec (nehalem) - 12.8 GB/sec (hapertown) • Bus width 64-bit • GPU / GTX280 • 141.7 GB/sec • Bus width 512-bit • 4.39x – 11x
What the programmer needs to know? • Many details about the architecture • But fortunately most of it is simple
My first CUDA Program __global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; } int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); } GPU CPU
Threads / Blocks / Grid Block size = 12 #blocks = 5 Block 0: a[0]…a[11] … Block 4: a[48] .. a[59] a[48] a[59]
Memory Hierarchy Anything declared inside The kernel __shared__ int… __global__ int…
Performance Programmer’s view Mark Silberstein, Technion
CUDA keywords, etc. • Declspecs • global, device, shared, local, constant • Keywords • threadIdx, blockIdx • Intrinsics • __syncthreads • Runtime API • Memory, symbol, execution management • Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; ... region[threadIdx] = image[i]; __syncthreads() ... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) cudaThreadSynchronize (); // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage);
Floating-Point Caveats • Single precisions floating point support is not 100% IEEE 754 • No denormals, fixed rounding modes • Must check that SNR remains acceptable • But there are lots of SP FP units • GTX280 supports double precision • But there are very few of these units
Development Process • Course Specific • Get an account on the eecg network • Fill in your name/ID/current e-mail on the list • Wait until confirmation is received • Machines • ug51.eecg through ug75.eecg.utoronto.ca • SF2204 • Keycode: _______
Development Process • Once you are on ugxx machine • source /cad1/CUDA/cuda.csh • That will create a NVIDIA_CUDA_SDK • Go in and type “make dbg=1” • This builds several examples under bin/linux/debug • The source code is in the projects subdir • We’ll post a handout soon on the course website
Development Process • Create a xxxx.cu file • Compile it with nvcc • Makefile is provided by the SDK • Nvcc is a preprocessor
So, why would Parallel Processing work? • Parallel Processing and Programming has been around for a while • Golden age was the 80s • Didn’t work • Programming is hard • Hardware was expensive • Single processor performance was doubling every 18 months • Why would it work now? • Cost / Single processor • Not a done deal at all Programming is still hard
Course Staff • Andreas Moshovos • EA310, 416-946-7373 • moshovos@eecg.toronto.edu • www.eecg.toronto.edu/~moshovos • TA • Hassan Shojania • hassan@eecg.toronto
Course Structure • Till the end of February / weekly assignments • CUDA programming • GTX280 architecture • CUDA performance • Floating Point • March / Project Proposal and work • Case studies • General Parallel Programming guidelines • April • Project Presentations • Make up lectures?