1 / 23

CS179: GPU Programming

CS179: GPU Programming. Lecture 7: Lab 3 Recitation. Today. Miscellaneous CUDA syntax Recap on CUDA and buffers Shared memory for an N-body simulation Flocking simulations Integrators. CUDA Kernels. Launching the kernel: kernel<<< gridDim , blockDim , sMemSize >>>( args );

pierce
Download Presentation

CS179: GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS179: GPU Programming Lecture 7: Lab 3 Recitation

  2. Today • Miscellaneous CUDA syntax • Recap on CUDA and buffers • Shared memory for an N-body simulation • Flocking simulations • Integrators

  3. CUDA Kernels • Launching the kernel: • kernel<<<gridDim, blockDim, sMemSize>>>(args); • Need to know gridDim, blockDim, sMemSize (and args) • If no sMemSize set, it will default to 0

  4. CUDA Kernels • Grid and block architecture: • Grids can be 1D, 2D, or on CUDA 2.x+, 3D • Blocks can be 1D, 2D, or 3D • 1024 threads per block maximum (512 on older systems) • Dimension is only for convenience, choose what’s best for you • Most applications are fine in 1D • Image processing may lend more intuitively to a 2D block/grid • Shared memory size: • Requirement is application-dependent • Limited by CUDA version (probably 48kB for you)

  5. CUDA Functions • Three different kinds of CUDA functions: • __host__: runs on CPU (__host__ keyword is superfluous) • __device__: runs on GPU, only called from GPU • Think of these as helper functions • __global__: runs on GPU, only called from CPU • These are our kernel functions

  6. CUDA Functions • Things to be aware of: • On older CUDA, __device__ and __global__ don’t have recursion • Cannot have function pointers to __device__ functions • Restrictions on __global__ functions: • Must return void • 64kB maximum size for parameters

  7. CUDA Functions • Error checking for memory calls: • Can check status of function using cudaGetErrorString() • For lab 3, we make you a macro: #define gpuErrchk(ans) { gpuAssert((ans), (char*)__FILE__, __LINE__); } inline void gpuAssert(cudaError_t code, char* file, int line, bool abort=true) { if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) exit(code); } } • Call as gpuErrchk(cudaMemcpy(…));

  8. CUDA Variables • Like functions, have a few different types: • __device__/__constant__ • Stored in global/constant memory, respectively • Accessible by all threads and blocks • Set using cudaMalloc, cudaMemset, cudaMemcpy, etc. • We can also write to __device__ memory on GPU • __shared__ • Lives in shared memory • Accessible only by threads within associated block • Requires syncthreads call to guarantee “correctness”

  9. CUDA Variables • Some CUDA vector variable types: • char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1, short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3, uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4, float1, float2, float3, float4, double2, … • Vector components available via .x, .y, .z, .w • var.x • Make vectors with make_<type>(args) • var = make_float3(1.0, 2.0, 3.0); • dim3: used for assigning block/grid size • Essentially just a uint3 • Each component of a dim3 must be at least 1!

  10. CUDA and Buffers • Need to know how to link buffers into CUDA • Nothing conceptually new, just some functions: • cudaGLRegisterBufferObject(bufferObj) • Used to first register the buffer into CUDA -- done once • cudaGLUnregisterBufferObject(bufferObj) • Once we’re done with it, we unregister -- done once • cudaGLMapBufferObject((void**)&devPtr, bufferObj); • Associates CUDA memory with the buffer -- done once per kernel call • cudaGLUnmapBufferObject(bufferObj); • Disassociates the buffer so OpenGL can read -- done once per kernel call after kernel finishes • Remember to include <cuda_gl_interop.h>

  11. N-Body Simulation • 1 thread = 1 particle • Kernel call handles one step in simulation • Calculate acceleration, then velocity, then position* *not quite, as we’ll see in a few slides • 1 block wont be enough for all of the particles • How do we share all positions? • Load as much global memory into shared memory • Calculate acceleration based on those positions • Update velocity, then load new global memory and repeat

  12. N-Body Simulation

  13. Flocking • First, a video: • https://www.youtube.com/watch?v=ctMty7av0jc

  14. Flocking • 2 main ideas (3 for bird flocking) • Separation: bugs will try and stay away from other bugs • Cohesion: bugs will try to stay near the center of the swarm • Alignment: birds will try and head towards the average heading • Not present in bug flocking algorithms

  15. Flocking • Separation: think repelling magnets • Inverse squared law works pretty well: • accel ~= 1/d2, where d is the distance between two particles

  16. Flocking • Cohesion: move towards average position • Cohesion fights separation, try and find factors that balance the two out well

  17. Flocking • Alignment: steer towards average heading of neighbors • Dependent on both positions AND velocities! • Requires you make more buffers to store velocities • A fair amount more work.. good candidate for extra credit!

  18. Integrators • After acceleration is calculated, update • Simple Euler is easiest… • But is a bad integrator! • Symplectic Euler works better: • Basic idea: update velocity based on old position, then update new position based on new velocity • new_vel = old_vel + dt * accel(old_pos) • new_pos = old_pos + dt * new_vel • More complex integrators can be even more accurate, but even harder to implement • If you have time, try implementing a different one (Runge-Kutta, maybe?) for EC

  19. Pingponging • Two sets of buffers, one for new, one for old. • Why? • Suppose one block finishes while another block is still reading • New positions will be used for old calculations! • Solution: pingponging with 2 buffers • Both buffers already made for you • Use one set for old state, one for new state, then flip when done

  20. Final Notes • When loading from shared memory, be sure not to try and access out of bounds memory • Can use %, or mod by shared memory size • Problem: % is slow! • Solution: We’re provided a WRAP macro for you: #define WRAP(x,m) ((x)<(m)?(x):((x)-(m)))

  21. Final Notes • You will also need to set initial positions and velocities • This can be done however you’d like! • Idea: have a few initial clusters with semi-random velocities • Don’t feel restricted to this!

  22. Final Notes • gluPerspective: controls the camera • Based on your simulation, current setup might not fit • Feel free to adjust! • gluPerspective(float fov, float aspect_ratio, float near, float far)

  23. Final Notes • Due Wednesday, 5PM • OH at regular posted times • Important note: this lab will NOT work remotely! • Trying to ssh and compile will be fine, running will throw crazy errors! • 2 new CUDA-capable computers coming to 104ANB soon… • For now, get work done early if you need to use minuteman

More Related