310 likes | 491 Views
CS179: GPU Programming. Lecture 8: More CUDA Runtime. Today. CUDA arrays for textures CUDA runtime Helpful CUDA functions. CUDA Arrays. Recall texture memory Used to store large data Stored on GPU Accessible to all blocks, threads. CUDA Arrays. Used Texture memory for buffers (lab 3)
E N D
CS179: GPU Programming Lecture 8: More CUDA Runtime
Today • CUDA arrays for textures • CUDA runtime • Helpful CUDA functions
CUDA Arrays • Recall texture memory • Used to store large data • Stored on GPU • Accessible to all blocks, threads
CUDA Arrays • Used Texture memory for buffers (lab 3) • Allows vertex data to remain on GPU • How else can we access texture memory? • CUDA arrays
CUDA Arrays • Why CUDA arrays over normal arrays? • Better caching, 2D caching • Spatial locality • Supports wrapping/clamping • Supports filtering
CUDA Linear Textures • “Textures” but in global memory • Usage: • Step 1: Create texture reference • texture<TYPE> tex • TYPE = float, float3, int, etc. • Step 2: Bind memory to texture reference • cudaBindTexture(offset, tex, devPtr, size); • Step 3: Get data on device via tex1Dfetch • tex1DFetch(tex, x); • x is the byte where we want to read! • Step 4: Clean up after finished • cudaUnbindTexture(&tex)
CUDA Linear Textures • Texture reference properties: • texRef<type, dim, mode> • type = float, int, float3, etc. • dim = # of dimensions (1, 2, or 3) • mode = • cudaReadModeElementType: standard read • cudaReadModeNormalizedFloat: maps 0->0.0, 255->1.0 for ints->floats
CUDA Linear Textures • Important warning: • Textures are in a global space of memory • Threads can read and write to texture at same time • This can cause synchronization problems! • Do not rely on thread running order, ever
CUDA Linear Textures • Other limitations: • Only 1D, can make indexing and caching a bit less convenient • Pitch may be not ideal for 2D array • Not read-write • Solution: CUDA arrays
CUDA Arrays • Live in texture memory space • Access via texture fetches
CUDA Arrays • Step 1: Create channel description • Tells us texture attributes • cudaCreateChannelDesc(int x, int y, int z, int w, enum mode) • x, y, z, w are number of bytes per component • mode is cudaChannelFormatKindFloat, etc.
CUDA Arrays • Step 2: Allocate memory • Must be done dynamically • Use cudaMallocArray(cudaArray **array, structdesc, int size) • Most global memory functions work with CUDA arrays too • cudaMemcpyToArray, etc.
CUDA Arrays • Step 3: Create texture reference • texture<TYPE, dim, mode> texRef -- just as before • Parameters must match channel description where applicable • Step 4: Edit texture settings • Settings are encoded as texRefstruct members
CUDA Arrays • Step 5: Bind the texture reference to array • cudaBindTextureToArray(texRef, array) • Step 6: Access texture • Similar to before, now we have more options: • tex1DFetch(texRef, x) • tex2DFetch(texRef, x, y)
CUDA Arrays • Final Notes: • Coordinates can be normalized to [0, 1] if in float mode • Filter modes: nearest point or linear • Tells CUDA how to blend texture • Wrap vs. clamp: • Wrap: out of bounds accesses wrap around to other side • Ex.: (1.5, 0.5) -> (0.5, 0.5) • Clamp: out of bounds accesses set to border value • Ex.: (1.5, 0.5) -> (1.0, 0.5)
CUDA Arrays point sampling linear sampling
CUDA Arrays wrap clamp
CUDA Runtime • Nothing new, every function cuda____ is part of the runtime • Lots of other helpful functions • Many runtime functions based on making your program robust • Check properties of card, set up multiple GPUs, etc. • Necessary for multi-platform development!
CUDA Runtime • Starting the runtime: • Simply call a cuda_____ function! • CUDA can waste a lot of resources • Stop CUDA with cudaThreadExit() • Called automatically on CPU exit, but you may want to call earlier
CUDA Runtime • Getting devices and properties: • cudaGetDeviceCount(int * n); • Returns # of CUDA-capable devices • Can use to check if machine is CUDA-capable! • cudaSetDevice(int n) • Sets device n to the currently used device • cudaGetDeviceProperties(struct *devProp prop, int n); • Loads data from device n into prop
Device Properties • char name[256]: ASCII identifier of GPU • size_ttotalGlobalMem: Total global memory available • size_tsharedMemPerBlock: Shared memory available per multiprocessor • intregsPerBlock: How many registers we have per block • intwarpSize: size of our warps • size_tmemPitch: maximum pitch allowed for array allocation • intmaxThreadsPerBlock: maximum number of threads/block • intmaxThreadsDim[3]: maximum sizes of a block
Device Properties • intmaxGridSize[3]: maximum grid sizes • size_ttotalConstantMemory: maximum available constant memory • int major, int minor: major and minor versions of CUDA support • intclockRate: clock rate of device in kHz • size_ttextureAlignment: memory alignment required for textures • intdeviceOverlap: Does this device allow for memory copying while kernel is running? (0 = no, 1 = yes) • intmultiprocessorCount: # of multiprocessors on device
Device Properties • Uses? • Actually get values for memory, instead of guessing • Program to be accessible for multiple systems • Can get the best device
Device Properties • Getting the best device: • Pick a metric (Ex.: most multiprocessors could be good) intnum_devices, device; cudaGetDeviceCount(&num_devices); if (num_devices > 1) { intmax_mp = 0, best_device = 0; for (device = 0; device < num_devices; device++) { cudaDeviceProp prop; cudaGetDeviceProperties(&prop, device); intmp_count = prop.multiProcessorCount; if (mp_count > max_mp) { max_mp = mp_count; best_device = device; } } cudaSetDevice(best_device); }
Device Properties • We can also use this to launch multiple GPUs • Each GPU must have its own host thread • Multithread on CPU, each thread calls different device • Set device on thread using cudaSetDevice(n);
CUDA Runtime • Synchronization Note: • Most calls to GPU/CUDA are asynchronous • Some are synchonous (usually things dealing with memory) • Can force synchronization: • cudaThreadSynchronize() • Blocks until all devices are done • Good for error checking, timing, etc.
CUDA Events • Great for timing! • Can place event markers in CUDA to measure time • Example code: cudaEvent_t start, stop; cudaCreateEvent(&start); cudaCreateEvent(&stop); cudaEventRecord(start, 0); // DO SOME GPU CODE HERE cudaEventRecord(stop, 0); cudaEventSynchronize(stop); float elapsed_time; cudaEventElapsedTime(&elapsed_time, start, stop);
CUDA Streams • Streams manage concurrency and ordering • Ex.: call malloc, then kernel 1, then kernel 2, etc. • Calls in different streams are asynchronous! • Don’t know when each stream is where in code
Using Streams • Create stream • cudaStreamCreate(cudaStream_t *stream) • Copy memory using async calls: • cudaMemcpyAsync(…, cudaStream_t stream) • Call in kernel as another parameter: • kernel<<<gridDim, blockDim, sMem, stream>>> • Query if stream is done: • cudaStreamQuery(cudaStream_t stream) • returns cudaSuccess if stream is done, cudaErrorNotReady otherwise • Block process until a stream is done: • cudaStreamSynchronize(cudaStream_t stream) • Destroy stream & cleanup: • cudaStreamDestroy(cudaStream_t stream)
Using Streams • Example: cudaStream_t stream[2]; for (inti = 0; i < 2; ++i) cudaStreamCreate(&stream[i]); for (inti = 0; i < 2; ++i) cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]); for (inti = 0; i < 2; ++i) myKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr+ i * size, size); for (inti = 0; i < 2; ++i) cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]); cudaThreadSynchronize();
Next Time • Lab 4 Recitation: • 3D Textures • Pixel Buffer Objects (PBOs) • Fractals!