1 / 34

Programming with CUDA WS 08

Previously. CUDA Runtime ComponentCommon ComponentData types, math functions, timing, texturesDevice ComponentMath functions, warp voting, atomic functions, synch function, texturingHost ComponentHigh-level runtime APILow-level driver API. Previously. CUDA Runtime ComponentHost Component API

geri
Download Presentation

Programming with CUDA WS 08

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Programming with CUDA WS 08/09 Lecture 8 Thu, 18 Nov, 2008

    2. Previously CUDA Runtime Component Common Component Data types, math functions, timing, textures Device Component Math functions, warp voting, atomic functions, synch function, texturing Host Component High-level runtime API Low-level driver API

    3. Previously CUDA Runtime Component Host Component APIs Mutually exclusive Runtime API is easier to program, hides some details from programmer Driver API gives low level control, harder to program Provide: device initialization, management of device, streams and events

    4. Today CUDA Runtime Component Host Component APIs Provide: management of memory & textures, OpenGL/Direct3D interoperability (NOT covered)? Runtime API provides: emulation mode for debugging Driver API provides: management of contexts & modules, execution control Final Projects

    5. Memory Management: Linear Memory CUDA Runtime API Declare: TYPE* Allocate: cudaMalloc, cudaMallocPitch Copy: cudaMemcpy, cudaMemcpy2D Free: cudaFree CUDA Driver API Declare: CUdeviceptr Allocate: cuMemAlloc, cuMemAllocPitch Copy: cuMemcpy, cuMemcpy2D Free: cuMemFree Host Runtime Component

    6. Memory Management: Linear Memory Pitch (stride) – expected: // host code float *array2D; cudaMallocPitch ((void**) array2D, width*sizeof (float), height); // device code int size = width * sizeof (float); for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component

    7. Memory Management: Linear Memory Pitch (stride) – expected, WRONG: // host code float *array2D; cudaMallocPitch ((void**) array2D, width*sizeof (float), height); // device code int size = width * sizeof (float); for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component

    8. Memory Management: Linear Memory Pitch (stride) – CORRECT: // host code float *array2D; int pitch; cudaMallocPitch ((void**) array2D, &pitch, width*sizeof (float), height); // device code for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*pitch; for (int c = 0; c < width; ++c) float element = row[c]; } Host Runtime Component

    9. Memory Management: Linear Memory Pitch (stride) – why? Allocation using pitch functions appropriately pads memory for efficient transfer and copy Width of allocated rows may exceed width*sizeof(float)? True width given by pitch Host Runtime Component

    10. Memory Management: CUDA Arrays CUDA Runtime API Declare: cudaArray* Channel: cudaChannelFormatDesc, cudaCreateChannelDesc<TYPE> Allocate: cudaMallocArray Copy (from linear): cudaMemcpy2DToArray Free: cudaFreeArray Host Runtime Component

    11. Memory Management: CUDA Arrays CUDA Driver API Declare: CUarray Channel: CUDA_ARRAY_DESCRIPTOR object Allocate: cuArrayCreate Copy (from linear): CUDA_MEMCPY2D object Free: cuArrayDestroy Host Runtime Component

    12. Memory Management: various other functions to copy from Linear memory to CUDA arrays Host to constant memory See Reference Manual Host Runtime Component

    13. Texture Management Run-time API: texture type derived from struct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; } normalized: 0: false, otherwise true Host Runtime Component

    14. Texture Management filterMode: cudaFilterModePoint: no filtering, returned value is of nearest texel cudaFilterModeLinear: filters 2/4/8 neighbors for 1D/2D/3D texture, floats only addressMode: (x,y,z) cudaAddressModeClamp, cudaAddressModeWrap: normalized coordinates only Host Runtime Component

    15. Texture Management channelDesc: texel type struct cudaChannelFormatDesc { int x,y,z,w; enum cudaChannelFormatKind f; } x,y,z,w: #bits per component f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloat Host Runtime Component

    16. Texture Management Run-time API: texture type derived from struct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc; } Apply only to texture references bound to CUDA arrays Host Runtime Component

    17. Texture Management Binding a texture reference to a texture Runtime API: Linear memory: cudaBindTexture CUDA Array: cudaBindTextureToArray Driver API: Linear memory: cuTexRefSetAddress CUDA Array: cuTexRefSetArray Host Runtime Component

    18. Runtime API: debugging using the emulation mode No native debug support for device code Code should be compiled either for device emulation OR execution: mixing not allowed Device code is compiled for the host Host Runtime Component

    19. Runtime API: debugging using the emulation mode Features Each CUDA thread is mapped to a host thread, plus one master thread Each thread gets 256KB on stack Host Runtime Component

    20. Runtime API: debugging using the emulation mode Advantages Can use host debuggers Can use otherwise disallowed functions in device code, e.g. printf Device and host memory are both readable from either device or host Host Runtime Component

    21. Runtime API: debugging using the emulation mode Advantages Any device or host specific function can be called from either device or host code Runtime detects incorrect use of synch functions Host Runtime Component

    22. Runtime API: debugging using the emulation mode Some errors may still remain hidden Memory access errors Out of context pointer operations Incorrect outcome of warp vote functions as warp size is 1 in emulation mode Result of FP operations often different on host and device Host Runtime Component

    23. Driver API: Context management A context encapsulates all resources and actions performed within the driver API Almost all CUDA functions operate in a context, except those dealing with Device enumeration Context management Host Runtime Component

    24. Driver API: Context management Each host thread can have only one current device context at a time Each host thread maintains a stack of current contexts cuCtxCreate()? Creates a context Pushes it to the top of the stack Makes it the current context Host Runtime Component

    25. Driver API: Context management cuCtxPopCurrent()? Detaches the current context from the host thread – makes it “uncurrent” The context is now floating It can be pushed to any host thread's stack Host Runtime Component

    26. Driver API: Context management Each context has a usage count cuCtxCreate creates a context with a usage count of 1 cuCtxAttach increments the usage count cuCtxDetach decrements the usage count Host Runtime Component

    27. Driver API: Context management A context is destroyed when its usage count reaches 0. cuCtxDetach, cuCtxDestroy Host Runtime Component

    28. Driver API: Module management Modules are dynamically loadable packages of device code and data output by nvcc Similar to DLLs Host Runtime Component

    29. Driver API: Module management Dynamically loading a module and accessing its contents CUmodule cuModule; cuModuleLoad(&cuModule, “myModule.cubin”); CUfunction cuFunction; cuModuleGetFunction(&cuFunction, cuModule, “myKernel”); Host Runtime Component

    30. Driver API: Execution control Set kernel parameters cuFuncSetBlockShape()? #threads/block for the function How thread IDs are assigned cuFuncSetSharedSize()? Size of shared memory cuParam*()? Specify other parameters for next kernel launch Host Runtime Component

    31. Driver API: Execution control Launch kernel cuLaunch(), cuLaunchGrid()? Example 4.5.3.5 in Prog Guide Host Runtime Component

    32. Final Projects Ideas? DES cracker Image editor Resize and smooth an image Gamut mapping? 3D Shape matching

    33. All for today Next time Memory and Instruction optimizations

    34. On to exercises!

More Related