340 likes | 544 Views
Previously. CUDA Runtime ComponentCommon ComponentData types, math functions, timing, texturesDevice ComponentMath functions, warp voting, atomic functions, synch function, texturingHost ComponentHigh-level runtime APILow-level driver API. Previously. CUDA Runtime ComponentHost Component API
E N D
1. Programming with CUDAWS 08/09 Lecture 8
Thu, 18 Nov, 2008
2. Previously CUDA Runtime Component
Common Component
Data types, math functions, timing, textures
Device Component
Math functions, warp voting, atomic functions, synch function, texturing
Host Component
High-level runtime API
Low-level driver API
3. Previously CUDA Runtime Component
Host Component APIs
Mutually exclusive
Runtime API is easier to program, hides some details from programmer
Driver API gives low level control, harder to program
Provide: device initialization, management of device, streams and events
4. Today CUDA Runtime Component
Host Component APIs
Provide: management of memory & textures, OpenGL/Direct3D interoperability (NOT covered)?
Runtime API provides: emulation mode for debugging
Driver API provides: management of contexts & modules, execution control
Final Projects
5. Memory Management: Linear Memory
CUDA Runtime APIDeclare: TYPE*Allocate: cudaMalloc, cudaMallocPitchCopy: cudaMemcpy, cudaMemcpy2DFree: cudaFree
CUDA Driver APIDeclare: CUdeviceptrAllocate: cuMemAlloc, cuMemAllocPitchCopy: cuMemcpy, cuMemcpy2DFree: cuMemFree Host Runtime Component
6. Memory Management: Linear Memory
Pitch (stride) – expected:// host codefloat *array2D;cudaMallocPitch ((void**) array2D, width*sizeof (float), height);// device codeint size = width * sizeof (float);for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c];} Host Runtime Component
7. Memory Management: Linear Memory
Pitch (stride) – expected, WRONG:// host codefloat *array2D;cudaMallocPitch ((void**) array2D, width*sizeof (float), height);// device codeint size = width * sizeof (float);for (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*size; for (int c = 0; c < width; ++c) float element = row[c];} Host Runtime Component
8. Memory Management: Linear Memory
Pitch (stride) – CORRECT:// host codefloat *array2D; int pitch;cudaMallocPitch ((void**) array2D, &pitch, width*sizeof (float), height);// device codefor (int r = 0; r < height; ++r) { float *row = (float*) ((char*)array2D + r*pitch; for (int c = 0; c < width; ++c) float element = row[c];} Host Runtime Component
9. Memory Management: Linear Memory
Pitch (stride) – why?
Allocation using pitch functions appropriately pads memory for efficient transfer and copy
Width of allocated rows may exceed width*sizeof(float)?
True width given by pitch Host Runtime Component
10. Memory Management: CUDA Arrays
CUDA Runtime APIDeclare: cudaArray*Channel: cudaChannelFormatDesc, cudaCreateChannelDesc<TYPE>Allocate: cudaMallocArrayCopy (from linear): cudaMemcpy2DToArrayFree: cudaFreeArray Host Runtime Component
11. Memory Management: CUDA Arrays
CUDA Driver APIDeclare: CUarrayChannel: CUDA_ARRAY_DESCRIPTOR objectAllocate: cuArrayCreateCopy (from linear): CUDA_MEMCPY2D objectFree: cuArrayDestroy Host Runtime Component
12. Memory Management: various other functions to copy from
Linear memory to CUDA arrays
Host to constant memory
See Reference Manual Host Runtime Component
13. Texture Management
Run-time API: texture type derived fromstruct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc;}
normalized: 0: false, otherwise true Host Runtime Component
14. Texture Management
filterMode:cudaFilterModePoint: no filtering, returned value is of nearest texel cudaFilterModeLinear: filters 2/4/8 neighbors for 1D/2D/3D texture, floats only
addressMode: (x,y,z)cudaAddressModeClamp, cudaAddressModeWrap: normalized coordinates only Host Runtime Component
15. Texture Management
channelDesc: texel typestruct cudaChannelFormatDesc { int x,y,z,w; enum cudaChannelFormatKind f;}
x,y,z,w: #bits per component
f: cudaChannelFormatKindSigned, cudaChannelFormatKindUnSigned, cudaChannelFormatKindFloat Host Runtime Component
16. Texture Management
Run-time API: texture type derived fromstruct textureReference { int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[3]; struct cudaChannelFormatDesc channelDesc;}
Apply only to texture references bound to CUDA arrays Host Runtime Component
17. Texture Management
Binding a texture reference to a texture
Runtime API:
Linear memory: cudaBindTexture
CUDA Array: cudaBindTextureToArray
Driver API:
Linear memory: cuTexRefSetAddress
CUDA Array: cuTexRefSetArray Host Runtime Component
18. Runtime API: debugging using the emulation mode
No native debug support for device code
Code should be compiled either for device emulation OR execution: mixing not allowed
Device code is compiled for the host Host Runtime Component
19. Runtime API: debugging using the emulation mode
Features
Each CUDA thread is mapped to a host thread, plus one master thread
Each thread gets 256KB on stack Host Runtime Component
20. Runtime API: debugging using the emulation mode
Advantages
Can use host debuggers
Can use otherwise disallowed functions in device code, e.g. printf
Device and host memory are both readable from either device or host Host Runtime Component
21. Runtime API: debugging using the emulation mode
Advantages
Any device or host specific function can be called from either device or host code
Runtime detects incorrect use of synch functions Host Runtime Component
22. Runtime API: debugging using the emulation mode
Some errors may still remain hidden
Memory access errors
Out of context pointer operations
Incorrect outcome of warp vote functions as warp size is 1 in emulation mode
Result of FP operations often different on host and device Host Runtime Component
23. Driver API: Context management
A context encapsulates all resources and actions performed within the driver API
Almost all CUDA functions operate in a context, except those dealing with
Device enumeration
Context management Host Runtime Component
24. Driver API: Context management
Each host thread can have only one current device context at a time
Each host thread maintains a stack of current contexts
cuCtxCreate()?
Creates a context
Pushes it to the top of the stack
Makes it the current context Host Runtime Component
25. Driver API: Context management
cuCtxPopCurrent()?
Detaches the current context from the host thread – makes it “uncurrent”
The context is now floating
It can be pushed to any host thread's stack Host Runtime Component
26. Driver API: Context management
Each context has a usage count
cuCtxCreate creates a context with a usage count of 1
cuCtxAttach increments the usage count
cuCtxDetach decrements the usage count Host Runtime Component
27. Driver API: Context management
A context is destroyed when its usage count reaches 0.
cuCtxDetach, cuCtxDestroy Host Runtime Component
28. Driver API: Module management
Modules are dynamically loadable packages of device code and data output by nvcc
Similar to DLLs Host Runtime Component
29. Driver API: Module management
Dynamically loading a module and accessing its contentsCUmodule cuModule;cuModuleLoad(&cuModule, “myModule.cubin”);CUfunction cuFunction;cuModuleGetFunction(&cuFunction, cuModule, “myKernel”); Host Runtime Component
30. Driver API: Execution control
Set kernel parameters
cuFuncSetBlockShape()?
#threads/block for the function
How thread IDs are assigned
cuFuncSetSharedSize()?
Size of shared memory
cuParam*()?
Specify other parameters for next kernel launch Host Runtime Component
31. Driver API: Execution control
Launch kernel
cuLaunch(), cuLaunchGrid()?
Example 4.5.3.5 in Prog Guide Host Runtime Component
32. Final Projects Ideas?
DES cracker
Image editor
Resize and smooth an image
Gamut mapping?
3D Shape matching
33. All for today Next time
Memory and Instruction optimizations
34. On to exercises!