CS 264, Lab 4: Texture Memory

1. CS 264, Lab 4: Texture Memory Mike Roberts mroberts@seas.harvard.edu

2. Outline Introduction to texture memory Introduction to HW2: image processing with CUDA Example code walkthrough: image dilation

3. Introduction to Texture Memory

4. Introduction to Texture Memory Device memory with a hardware-managed cache Fetching data from nearby locations can be serviced directly from the cache without going to global memory Strongly typed at allocation time as 1D/2D/3D, 8/16/32 bit, signed/unsigned/float, with specific width/height/depth

5. Texture Memory Cache Optimized for 2D spatial locality via z-order curve NVIDIA does not release the details so they can change how it works whenever they think of a better idea

6. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Get some things for free: Linear interpolation of adjacent data values

7. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Get some things for free: Automatic normalization of data when you fetch it [0,1] for unsigned values [-1,1] for signed values Automatic normalization of array indices to [0,1] e.g. same code for all problem sizes

8. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Get some things for free: Automatic boundary handling

9. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline Mix and match the features you want The previously mentioned features are optional

10. Not-So-Nice Texture Memory Features Read-only Can only update 3D textures by performing a memcpy to some rectangular region of the texture However if your texture is 2D, you can write directly into it via a surface object Not covered here, consult the most recent programming guide for details

11. When to Use Texture Memory If you update your data rarely but read it often... �especially if there tends to be some kind of spatial locality to the read access pattern� i.e. nearby threads access nearby locations in the texture �and especially if the precise read access pattern is difficult to predict Also, you need to use texture memory in order to visualize your data using the graphics pipeline Not covered here, consult the SDK samples

12. When Not to Use Texture Memory If you read your data exactly once after you update it. e.g. It would not be appropriate to store the vector y from HW1 (y=ax+y) in texture memory, since there is a predicable one-to-one relationship between reads and writes.

13. HW2 � Image Processing with CUDA Many image processing algorithms are of this form: Iout(x,y) = f(neighborhood around Iin(x,y)) This is also a common pattern in physics simulations HW2 will ask you to implement some kind of image processing algorithm like this efficiently in CUDA.

14. HW2 � Image Processing with CUDA

15. HW2 � Image Processing with CUDA Iout(x,y) = f(neighborhood around Iin(x,y)) Note the memory access pattern:



18. HW2 � Image Processing with CUDA We don�t want to have to read k elements from global memory for each element we write to in global memory

19. HW2 � Image Processing with CUDA We can use the hardware managed texture cache to help us

20. Example Code Walkthrough: Image Dilation

21. Example Code Walkthrough: Image Dilation This code walkthrough performs the �image dilation� operation: Iout(x,y) = max(neighborhood around Iin(x,y)) Useful in low-level computer vision Very easy to implement

22. Example Code Walkthrough: Image Dilation

23. Example Code Walkthrough: Image Dilation image_dilate.cu Lobotomized version of HW0/HW1 code, but implementing image dilation on text files of 0s and 1s

24. Example Code Walkthrough: Image Dilation To compile: nvcc image_dilate.cu -o image_dilate -I$CUDASDK_HOME/C/common/inc -L$CUDASDK_HOME/C/lib -lcutil_x86_64 Usage: ./image_dilate.cu �out=�random_numbers.txt� �n=1024 (generates a random test file to be interpreted as a sqrt(n) by sqrt(n) 1s and 0s) Then: ./image_dilate.cu �in=�random_numbers.txt� (performs dilation on CPU and GPU, saves output for CPU and GPU, verifies that the CPU and GPU results are the same)

25. Example Code Walkthrough: Image Dilation Texture memory is allocated by allocating what are known as CUDA Arrays. We need to provide extra information about our data at allocation time to enable all the nice features we�re getting: cudaChannelFormatDesc channel_desc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned); cudaArray* a_before_dilated; cudaMallocArray(&a_before_dilated, &channel_desc, width, height); � cudaFreeArray(a_before_dilated);

26. Example Code Walkthrough: Image Dilation Memcpy API to move data into the Array we allocated: cudaMemcpyToArray( a_before_dilated, // array pointer 0, // array offset width 0, // array offset height h_before_dilated, // source width*height*sizeof(uchar1), // size in bytes cudaMemcpyHostToDevice); // type of memcpy

27. Example Code Walkthrough: Image Dilation Accessing array data is done through texture references. Texture references are views into the underlying Array data Declaring a texture reference: texture<uchar1, 2, cudaReadModeElementType> my_tex_ref; Declares the type of the underlying data, the number of dimensions, and whether or not the data is automatically converted into a normalized float Texture references must be global variables in your .cu file

28. Example Code Walkthrough: Image Dilation Specifying what type of addressing and interpolation the texture reference should use: my_tex_ref.addressMode[0] = cudaAddressModeClamp; my_tex_ref.addressMode[1] = cudaAddressModeClamp; my_tex_ref.filterMode = cudaFilterModePoint; my_tex_ref.normalized = false;

29. Example Code Walkthrough: Image Dilation Pointing a texture reference to look at a specific Array: cudaBindTextureToArray(my_tex_ref, a_before_dilated, channel_desc); Now we�re ready to access our Array data in a kernel

30. Example Code Walkthrough: Image Dilation Accessing array data in a kernel: If your indices are not normalized to [0,1], then you need to add 0.5f to each coordinate Quirk inherited from the graphics pipeline __global__ void dilate_gpu_texture() { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; uchar1 data_value = tex2D( my_tex_ref, x+0.5f, y+0.5f ); }

31. Example Code Walkthrough: Image Dilation Launching our kernel: We want as many threads as we have independent tasks Since each output pixel is independent of each other output pixel, it makes sense to have one thread per output pixel So, we need enough threads to �cover� our output image

32. Example Code Walkthrough: Image Dilation Launching our kernel: We also want as many threads as possible in each block If one thread is stalled, we want to be able to switch to another one

33. Example Code Walkthrough: Image Dilation Launching our kernel: Since the texture cache is optimized for 2D locality, it makes sense to arrange our thread blocks into 2D tiles Each thread is responsible for writing one output pixel dim3 blockDims( BLOCK_WIDTH, BLOCK_HEIGHT, 1 ); dim3 gridDims( (width+blockDims.x-1) / blockDims.x, (height+blockDims.y-1) / blockDims.y, 1 ); dilate_gpu_texture<<<gridDims,blockDims>>>();

CS 264, Lab 4: Texture Memory

CS 264, Lab 4: Texture Memory

Presentation Transcript

Lab 4

CS 468 Cg Texture Mapping

H.264 MPEG-4 Codec

Memory – CS Perspective

Texture Memory

CS 179: Lecture 4 Lab Review 2

Lab #4

CS 3388: Texture Mapping

CS 295 – Memory Models

CS 1150 – Lab #4 – Logic Circuits

CS 136 Lab 2

LAB 4

Lab 4: Exploiting the memory hierarchy - Session 14 -

CS 3853 Computer Architecture Lecture 4 â€“ Memory Hierarchy

CS 470 Lab 5

CS 110 Lab

CS 147 Cache Memory

CS 5513 Computer Architecture Lecture 4 – Memory Hierarchy Review

CS 136 Lab 2