1 / 21

CS179: GPU Programming

CS179: GPU Programming. Lecture 11: Lab 5 Recitation. Today. Monte-Carlo Integration Recap on CUBLAS/CURAND Reductions Optimizing a reduction. Monte-Carlo Integration. Integration is a common tool is computational math Oftentimes used for finding areas Integration is hard on a computer

dara
Download Presentation

CS179: GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS179: GPU Programming Lecture 11: Lab 5 Recitation

  2. Today • Monte-Carlo Integration • Recap on CUBLAS/CURAND • Reductions • Optimizing a reduction

  3. Monte-Carlo Integration • Integration is a common tool is computational math • Oftentimes used for finding areas • Integration is hard on a computer • Difficult to do analytically • Integration is sometimes analytically impossible • Can’t integrate exp(x2) analytically..

  4. Monte-Carlo Integration • Could use discrete Riemann sum

  5. Monte-Carlo Integration • What if there’s no predefined function? • Ex.: Area of union of shapes

  6. Monte-Carlo Integration • Solution: Monte-Carlo Integration • Saturate bounded space with sample points • Check if each point is in any shape • Area = # of points in a shape / # of points total * area of space

  7. Monte-Carlo Integration • Lab 5: Given N spheres in a bounded space, find the volume of their union • Possible to do analytically… • But very difficult! • Spheres have random positions, area of intersections, etc. • Makes good use of Monte-Carlo integration • Easy to check if a point is in any of the spheres • Easy to use CURAND to generate lots of points!

  8. Lab 5 • Remember: CURAND has host API and device API • You will use both! • volumeCUBLAS: uses host API with CUBLAS • volumeCUDA: uses device API with reduction kernel

  9. Lab 5volumeCUBLAS • Allocate necessary memory • Need memory for points • Need memory for 1 bool per point • Is point in any sphere? • Use CURAND host API to generate lots of points • Create, seed, generate, destroy • Use CheckPointsK kernel to see if each point is in a sphere • You must write this kernel! • Get total # of points in a sphere using cublasDasum • cublasDasum(int n, double *src, int stride) • Free initialized memory

  10. Lab 5volumeCUDA • Allocate memory for data • Now, we also need memory for curandStates! • Generate lots of points using CURAND device API • Call GenerateRandom3K kernel -- but you must fill in the kernel! • Check if points are in sphere • Same as volumeCUBLAS • Use reduction to sum vector • More on this later… • Free memory

  11. Lab 5Kernels • PointInSphere: Checks if a point is in a given sphere • Do this first! • Should be easy geometry • CheckPointsK: Checks if a point is in any sphere • Copy spheres to shared memory, then iterate through spheres • Remember to make sure array entry is non-NULL • GenerateRandom3K: Generates lots of float3 points • Use CURAND device API

  12. Reduction • Iteratively reduces array via reduce function (ex. addition)

  13. Reduction • Start with size = nPts / 2 • Repeatedly call reduction on block size, halving it each time • With main loop in host, device code is very simple… • Just need to add element i and element i + size for each thread • Alternatively, could build loop into device code, and call kernel only once • Once size == 1, we should have summed up all elements

  14. Reduction • Lots of optimizations to make! • Avoiding thread divergence • Contiguous memory accesses • Avoiding shared memory bank conflicts • More we haven’t discussed yet… • Unrolling loops • Templates • And more!

  15. Optimizations • Avoiding thread divergence • Avoid calls that make different calls to threads in same warp • if(threadIdx.x % 2 == 0) • Instead, group by warps • if(threadIdx.x / WARP_SIZE == 0) 1 2 3 4 1 2 3 4 0 2 0 1 3 2 7 4 3 2 7 4 0 1 10 2 7 4 10 2 7 4

  16. Optimizations • Contiguous memory accesses • Memory is linear, can’t swap dimensions • Need to address non-sequential accesses… • Shared memory banks • Also solved by sequential addressing! 1 2 3 4 1 2 3 4 3 2 7 4 3 2 7 4 10 2 7 4 10 2 7 4

  17. Optimizations • Example in reduction kernel: • Reversed loop indexing • for (inti = 1; i < max_size; i *= 2) { … } • for (inti = max_size / 2; i > 0; i /= 2) { … } 1 2 3 4 1 2 3 4 3 2 7 4 3 2 7 4 10 2 7 4 10 2 7 4

  18. Optimizations • Unrolling loops • Basic idea: when reduction size < 32, threads are wasting space due to warps • Unrolling last iteration of loop saves useless work

  19. Optimizations • Unrolling loops example: for (inti = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; } for (inti = max_size / 2; i > 0; i /= 2) { sdata[tid] += sdata[tid + i]; if (tid < 32) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; // etc… } }

  20. Optimizations • Advanced unrolling: templates • Exploit compiler to handle some conditions at compile-time • Use templated functions (like in C++) • Ex.: template<unsigned intblockSize> __global__ void kernel(…) { if (blockSize >= 512) // some reduction code; else if (blockSize >= 256) // some reduction code; // etc… } • Then, call templated function on host: • kernel<blockSize><<<gridSize, blockSize>>>(…);

  21. Optimizations • Works well with a switch statement: switch (numThreads) { case 512: kernel<512><<<gridSize, blockSize>>>(…); case 256: kernel<256><<<gridSize, blockSize>>>(…); case 128: kernel<128><<<gridSize, blockSize>>>(…); // etc… }

More Related