GPU Computing Fundamentals: Add Two Arrays Step by Step

CS 179: Lecture 2Lab Review 1

The Problem • Add two arrays • A[] + B[] -> C[]

GPU Computing: Step by Step • Setup inputs on the host (CPU-accessible memory) • Allocate memory for inputs on the GPU • Copy inputs from host to GPU • Allocate memory for outputs on the host • Allocate memory for outputs on the GPU • Start GPU kernel • Copy output from GPU to host • (Copying can be asynchronous)

The Kernel • Determine a thread index from block ID and thread ID within ablock:

Calling the Kernel …

CUDA implementation (2)

Fixing the Kernel • For large arrays, our kernel doesn’t work! • Bounds-checking – be on the lookout! • Also, need a way for kernel to handle a few more elements…

Fixing the Kernel – Part 1

Fixing the Kernel – Part 2

Fixing our Call

Lab 1! • Sum of polynomials – Fun, parallelizable example! • Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by: • We want, for r0, …, rN-1, the sum: • Output condenses to one number!

Calculating P(r) once • Pseudocode (one possible method): Given r, coefficients[] result <- 0.0 power <- 1.0 for all coefficient indeciesi from 0 to n-1: result += (coefficients[i] * power) power *= r

Accumulation • atomicAdd() function • Important for safe operations!

Accumulation

Shared Memory • Faster than global memory • Per-block • One block

Linear Accumulation • atomicAdd() has a choke point! • What if we reduced our results in parallel?

Linear Accumulation …

Linear Accumulation (2)

Can we do better?

Last notes • minuteman.cms.caltech.edu – the easiest option • CMS accounts! • Office hours • Kevin: Monday, 8-10 PM • Connor: Tuesday, 8-10 PM

GPU Computing Fundamentals: Add Two Arrays Step by Step