1 / 33

CS 264, Lab 4: Texture Memory

Outline. Introduction to texture memoryIntroduction to HW2: image processing with CUDAExample code walkthrough: image dilation. Introduction to Texture Memory. . Introduction to Texture Memory. Device memory with a hardware-managed cacheFetching data from nearby locations can be serviced directly

jimbo
Download Presentation

CS 264, Lab 4: Texture Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. CS 264, Lab 4: Texture Memory Mike Roberts mroberts@seas.harvard.edu

    2. Outline Introduction to texture memory Introduction to HW2: image processing with CUDA Example code walkthrough: image dilation

    3. Introduction to Texture Memory

    4. Introduction to Texture Memory Device memory with a hardware-managed cache Fetching data from nearby locations can be serviced directly from the cache without going to global memory Strongly typed at allocation time as 1D/2D/3D, 8/16/32 bit, signed/unsigned/float, with specific width/height/depth

    5. Texture Memory Cache Optimized for 2D spatial locality via z-order curve NVIDIA does not release the details so they can change how it works whenever they think of a better idea

    6. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Get some things for free: Linear interpolation of adjacent data values

    7. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Get some things for free: Automatic normalization of data when you fetch it [0,1] for unsigned values [-1,1] for signed values Automatic normalization of array indices to [0,1] e.g. same code for all problem sizes

    8. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Get some things for free: Automatic boundary handling

    9. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Mix and match the features you want The previously mentioned features are optional

    10. Not-So-Nice Texture Memory Features Read-only Can only update 3D textures by performing a memcpy to some rectangular region of the texture However if your texture is 2D, you can write directly into it via a surface object Not covered here, consult the most recent programming guide for details

    11. When to Use Texture Memory If you update your data rarely but read it often... …especially if there tends to be some kind of spatial locality to the read access pattern… i.e. nearby threads access nearby locations in the texture …and especially if the precise read access pattern is difficult to predict Also, you need to use texture memory in order to visualize your data using the graphics pipeline Not covered here, consult the SDK samples

    12. When Not to Use Texture Memory If you read your data exactly once after you update it. e.g. It would not be appropriate to store the vector y from HW1 (y=ax+y) in texture memory, since there is a predicable one-to-one relationship between reads and writes.

    13. HW2 – Image Processing with CUDA Many image processing algorithms are of this form: Iout(x,y) = f(neighborhood around Iin(x,y)) This is also a common pattern in physics simulations HW2 will ask you to implement some kind of image processing algorithm like this efficiently in CUDA.

    14. HW2 – Image Processing with CUDA

    15. HW2 – Image Processing with CUDA Iout(x,y) = f(neighborhood around Iin(x,y)) Note the memory access pattern:

    16. HW2 – Image Processing with CUDA Iout(x,y) = f(neighborhood around Iin(x,y)) Note the memory access pattern:

    17. HW2 – Image Processing with CUDA Iout(x,y) = f(neighborhood around Iin(x,y)) Note the memory access pattern:

    18. HW2 – Image Processing with CUDA We don’t want to have to read k elements from global memory for each element we write to in global memory

    19. HW2 – Image Processing with CUDA We can use the hardware managed texture cache to help us

    20. Example Code Walkthrough: Image Dilation

    21. Example Code Walkthrough: Image Dilation This code walkthrough performs the “image dilation” operation: Iout(x,y) = max(neighborhood around Iin(x,y)) Useful in low-level computer vision Very easy to implement

    22. Example Code Walkthrough: Image Dilation

    23. Example Code Walkthrough: Image Dilation image_dilate.cu Lobotomized version of HW0/HW1 code, but implementing image dilation on text files of 0s and 1s

    24. Example Code Walkthrough: Image Dilation To compile: nvcc image_dilate.cu -o image_dilate -I$CUDASDK_HOME/C/common/inc -L$CUDASDK_HOME/C/lib -lcutil_x86_64 Usage: ./image_dilate.cu –out=“random_numbers.txt” –n=1024 (generates a random test file to be interpreted as a sqrt(n) by sqrt(n) 1s and 0s) Then: ./image_dilate.cu –in=“random_numbers.txt” (performs dilation on CPU and GPU, saves output for CPU and GPU, verifies that the CPU and GPU results are the same)

    25. Example Code Walkthrough: Image Dilation Texture memory is allocated by allocating what are known as CUDA Arrays. We need to provide extra information about our data at allocation time to enable all the nice features we’re getting: cudaChannelFormatDesc channel_desc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned); cudaArray* a_before_dilated; cudaMallocArray(&a_before_dilated, &channel_desc, width, height); … cudaFreeArray(a_before_dilated);

    26. Example Code Walkthrough: Image Dilation Memcpy API to move data into the Array we allocated: cudaMemcpyToArray( a_before_dilated, // array pointer 0, // array offset width 0, // array offset height h_before_dilated, // source width*height*sizeof(uchar1), // size in bytes cudaMemcpyHostToDevice); // type of memcpy

    27. Example Code Walkthrough: Image Dilation Accessing array data is done through texture references. Texture references are views into the underlying Array data Declaring a texture reference: texture<uchar1, 2, cudaReadModeElementType> my_tex_ref; Declares the type of the underlying data, the number of dimensions, and whether or not the data is automatically converted into a normalized float Texture references must be global variables in your .cu file

    28. Example Code Walkthrough: Image Dilation Specifying what type of addressing and interpolation the texture reference should use: my_tex_ref.addressMode[0] = cudaAddressModeClamp; my_tex_ref.addressMode[1] = cudaAddressModeClamp; my_tex_ref.filterMode = cudaFilterModePoint; my_tex_ref.normalized = false;

    29. Example Code Walkthrough: Image Dilation Pointing a texture reference to look at a specific Array: cudaBindTextureToArray(my_tex_ref, a_before_dilated, channel_desc); Now we’re ready to access our Array data in a kernel

    30. Example Code Walkthrough: Image Dilation Accessing array data in a kernel: If your indices are not normalized to [0,1], then you need to add 0.5f to each coordinate Quirk inherited from the graphics pipeline __global__ void dilate_gpu_texture() { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; uchar1 data_value = tex2D( my_tex_ref, x+0.5f, y+0.5f ); }

    31. Example Code Walkthrough: Image Dilation Launching our kernel: We want as many threads as we have independent tasks Since each output pixel is independent of each other output pixel, it makes sense to have one thread per output pixel So, we need enough threads to “cover” our output image

    32. Example Code Walkthrough: Image Dilation Launching our kernel: We also want as many threads as possible in each block If one thread is stalled, we want to be able to switch to another one

    33. Example Code Walkthrough: Image Dilation Launching our kernel: Since the texture cache is optimized for 2D locality, it makes sense to arrange our thread blocks into 2D tiles Each thread is responsible for writing one output pixel dim3 blockDims( BLOCK_WIDTH, BLOCK_HEIGHT, 1 ); dim3 gridDims( (width+blockDims.x-1) / blockDims.x, (height+blockDims.y-1) / blockDims.y, 1 ); dilate_gpu_texture<<<gridDims,blockDims>>>();

More Related