1 / 31

CS179: GPU Programming

CS179: GPU Programming. Lecture 8: More CUDA Runtime. Today. CUDA arrays for textures CUDA runtime Helpful CUDA functions. CUDA Arrays. Recall texture memory Used to store large data Stored on GPU Accessible to all blocks, threads. CUDA Arrays. Used Texture memory for buffers (lab 3)

hammer
Download Presentation

CS179: GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS179: GPU Programming Lecture 8: More CUDA Runtime

  2. Today • CUDA arrays for textures • CUDA runtime • Helpful CUDA functions

  3. CUDA Arrays • Recall texture memory • Used to store large data • Stored on GPU • Accessible to all blocks, threads

  4. CUDA Arrays • Used Texture memory for buffers (lab 3) • Allows vertex data to remain on GPU • How else can we access texture memory? • CUDA arrays

  5. CUDA Arrays • Why CUDA arrays over normal arrays? • Better caching, 2D caching • Spatial locality • Supports wrapping/clamping • Supports filtering

  6. CUDA Linear Textures • “Textures” but in global memory • Usage: • Step 1: Create texture reference • texture<TYPE> tex • TYPE = float, float3, int, etc. • Step 2: Bind memory to texture reference • cudaBindTexture(offset, tex, devPtr, size); • Step 3: Get data on device via tex1Dfetch • tex1DFetch(tex, x); • x is the byte where we want to read! • Step 4: Clean up after finished • cudaUnbindTexture(&tex)

  7. CUDA Linear Textures • Texture reference properties: • texRef<type, dim, mode> • type = float, int, float3, etc. • dim = # of dimensions (1, 2, or 3) • mode = • cudaReadModeElementType: standard read • cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats

  8. CUDA Linear Textures • Important warning: • Textures are in a global space of memory • Threads can read and write to texture at same time • This can cause synchronization problems! • Do not rely on thread running order, ever

  9. CUDA Linear Textures • Other limitations: • Only 1D, can make indexing and caching a bit less convenient • Pitch may be not ideal for 2D array • Not read-write • Solution: CUDA arrays

  10. CUDA Arrays • Live in texture memory space • Access via texture fetches

  11. CUDA Arrays • Step 1: Create channel description • Tells us texture attributes • cudaCreateChannelDesc(int x, int y, int z, int w, enum mode) • x, y, z, w are number of bytes per component • mode is cudaChannelFormatKindFloat, etc.

  12. CUDA Arrays • Step 2: Allocate memory • Must be done dynamically • Use cudaMallocArray(cudaArray **array, structdesc, int size) • Most global memory functions work with CUDA arrays too • cudaMemcpyToArray, etc.

  13. CUDA Arrays • Step 3: Create texture reference • texture<TYPE, dim, mode> texRef -- just as before • Parameters must match channel description where applicable • Step 4: Edit texture settings • Settings are encoded as texRefstruct members

  14. CUDA Arrays • Step 5: Bind the texture reference to array • cudaBindTextureToArray(texRef, array) • Step 6: Access texture • Similar to before, now we have more options: • tex1DFetch(texRef, x) • tex2DFetch(texRef, x, y)

  15. CUDA Arrays • Final Notes: • Coordinates can be normalized to [0, 1] if in float mode • Filter modes: nearest point or linear • Tells CUDA how to blend texture • Wrap vs. clamp: • Wrap: out of bounds accesses wrap around to other side • Ex.: (1.5, 0.5) -> (0.5, 0.5) • Clamp: out of bounds accesses set to border value • Ex.: (1.5, 0.5) -> (1.0, 0.5)

  16. CUDA Arrays point sampling linear sampling

  17. CUDA Arrays wrap clamp

  18. CUDA Runtime • Nothing new, every function cuda____ is part of the runtime • Lots of other helpful functions • Many runtime functions based on making your program robust • Check properties of card, set up multiple GPUs, etc. • Necessary for multi-platform development!

  19. CUDA Runtime • Starting the runtime: • Simply call a cuda_____ function! • CUDA can waste a lot of resources • Stop CUDA with cudaThreadExit() • Called automatically on CPU exit, but you may want to call earlier

  20. CUDA Runtime • Getting devices and properties: • cudaGetDeviceCount(int * n); • Returns # of CUDA-capable devices • Can use to check if machine is CUDA-capable! • cudaSetDevice(int n) • Sets device n to the currently used device • cudaGetDeviceProperties(struct *devProp prop, int n); • Loads data from device n into prop

  21. Device Properties • char name[256]: ASCII identifier of GPU • size_ttotalGlobalMem: Total global memory available • size_tsharedMemPerBlock: Shared memory available per multiprocessor • intregsPerBlock: How many registers we have per block • intwarpSize: size of our warps • size_tmemPitch: maximum pitch allowed for array allocation • intmaxThreadsPerBlock: maximum number of threads/block • intmaxThreadsDim[3]: maximum sizes of a block

  22. Device Properties • intmaxGridSize[3]: maximum grid sizes • size_ttotalConstantMemory: maximum available constant memory • int major, int minor: major and minor versions of CUDA support • intclockRate: clock rate of device in kHz • size_ttextureAlignment: memory alignment required for textures • intdeviceOverlap: Does this device allow for memory copying while kernel is running? (0 = no, 1 = yes) • intmultiprocessorCount: # of multiprocessors on device

  23. Device Properties • Uses? • Actually get values for memory, instead of guessing • Program to be accessible for multiple systems • Can get the best device

  24. Device Properties • Getting the best device: • Pick a metric (Ex.: most multiprocessors could be good) intnum_devices, device; cudaGetDeviceCount(&num_devices); if (num_devices > 1) { intmax_mp = 0, best_device = 0; for (device = 0; device < num_devices; device++) { cudaDeviceProp prop; cudaGetDeviceProperties(&prop, device); intmp_count = prop.multiProcessorCount; if (mp_count > max_mp) { max_mp = mp_count; best_device = device; } } cudaSetDevice(best_device); }

  25. Device Properties • We can also use this to launch multiple GPUs • Each GPU must have its own host thread • Multithread on CPU, each thread calls different device • Set device on thread using cudaSetDevice(n);

  26. CUDA Runtime • Synchronization Note: • Most calls to GPU/CUDA are asynchronous • Some are synchonous (usually things dealing with memory) • Can force synchronization: • cudaThreadSynchronize() • Blocks until all devices are done • Good for error checking, timing, etc.

  27. CUDA Events • Great for timing! • Can place event markers in CUDA to measure time • Example code: cudaEvent_t start, stop; cudaCreateEvent(&start); cudaCreateEvent(&stop); cudaEventRecord(start, 0); // DO SOME GPU CODE HERE cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsed_time; cudaEventElapsedTime(&elapsed_time, start, stop);

  28. CUDA Streams • Streams manage concurrency and ordering • Ex.: call malloc, then kernel 1, then kernel 2, etc. • Calls in different streams are asynchronous! • Don’t know when each stream is where in code

  29. Using Streams • Create stream • cudaStreamCreate(cudaStream_t *stream) • Copy memory using async calls: • cudaMemcpyAsync(…, cudaStream_t stream) • Call in kernel as another parameter: • kernel<<<gridDim, blockDim, sMem, stream>>> • Query if stream is done: • cudaStreamQuery(cudaStream_t stream) • returns cudaSuccess if stream is done, cudaErrorNotReady otherwise • Block process until a stream is done: • cudaStreamSynchronize(cudaStream_t stream) • Destroy stream & cleanup: • cudaStreamDestroy(cudaStream_t stream)

  30. Using Streams • Example: cudaStream_t stream[2]; for (inti = 0; i < 2; ++i) cudaStreamCreate(&stream[i]); for (inti = 0; i < 2; ++i) cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]); for (inti = 0; i < 2; ++i) myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr+ i * size, size); for (inti = 0; i < 2; ++i) cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]); cudaThreadSynchronize();

  31. Next Time • Lab 4 Recitation: • 3D Textures • Pixel Buffer Objects (PBOs) • Fractals!

More Related