330 likes | 502 Views
Outline. Introduction to texture memoryIntroduction to HW2: image processing with CUDAExample code walkthrough: image dilation. Introduction to Texture Memory. . Introduction to Texture Memory. Device memory with a hardware-managed cacheFetching data from nearby locations can be serviced directly
E N D
1. CS 264, Lab 4: Texture Memory Mike Roberts
mroberts@seas.harvard.edu
2. Outline Introduction to texture memory
Introduction to HW2: image processing with CUDA
Example code walkthrough: image dilation
3. Introduction to Texture Memory
4. Introduction to Texture Memory Device memory with a hardware-managed cache
Fetching data from nearby locations can be serviced directly from the cache without going to global memory
Strongly typed at allocation time as 1D/2D/3D, 8/16/32 bit, signed/unsigned/float, with specific width/height/depth
5. Texture Memory Cache Optimized for 2D spatial locality via z-order curve
NVIDIA does not release the details so they can change how it works whenever they think of a better idea
6. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline
Get some things for free:
Linear interpolation of adjacent data values
7. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline
Get some things for free:
Automatic normalization of data when you fetch it
[0,1] for unsigned values
[-1,1] for signed values
Automatic normalization of array indices to [0,1]
e.g. same code for all problem sizes
8. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline
Get some things for free:
Automatic boundary handling
9. More Nice Texture Memory Features Inherits some nice features from the graphics pipeline
Mix and match the features you want
The previously mentioned features are optional
10. Not-So-Nice Texture Memory Features Read-only
Can only update 3D textures by performing a memcpy to some rectangular region of the texture
However if your texture is 2D, you can write directly into it via a surface object
Not covered here, consult the most recent programming guide for details
11. When to Use Texture Memory If you update your data rarely but read it often...
…especially if there tends to be some kind of spatial locality to the read access pattern…
i.e. nearby threads access nearby locations in the texture
…and especially if the precise read access pattern is difficult to predict
Also, you need to use texture memory in order to visualize your data using the graphics pipeline
Not covered here, consult the SDK samples
12. When Not to Use Texture Memory If you read your data exactly once after you update it.
e.g. It would not be appropriate to store the vector y from HW1 (y=ax+y) in texture memory, since there is a predicable one-to-one relationship between reads and writes.
13. HW2 – Image Processing with CUDA Many image processing algorithms are of this form:
Iout(x,y) = f(neighborhood around Iin(x,y))
This is also a common pattern in physics simulations
HW2 will ask you to implement some kind of image processing algorithm like this efficiently in CUDA.
14. HW2 – Image Processing with CUDA
15. HW2 – Image Processing with CUDA Iout(x,y) = f(neighborhood around Iin(x,y))
Note the memory access pattern:
16. HW2 – Image Processing with CUDA Iout(x,y) = f(neighborhood around Iin(x,y))
Note the memory access pattern:
17. HW2 – Image Processing with CUDA Iout(x,y) = f(neighborhood around Iin(x,y))
Note the memory access pattern:
18. HW2 – Image Processing with CUDA We don’t want to have to read k elements from global memory for each element we write to in global memory
19. HW2 – Image Processing with CUDA We can use the hardware managed texture cache to help us
20. Example Code Walkthrough: Image Dilation
21. Example Code Walkthrough: Image Dilation This code walkthrough performs the “image dilation” operation:
Iout(x,y) = max(neighborhood around Iin(x,y))
Useful in low-level computer vision
Very easy to implement
22. Example Code Walkthrough: Image Dilation
23. Example Code Walkthrough: Image Dilation image_dilate.cu
Lobotomized version of HW0/HW1 code, but implementing image dilation on text files of 0s and 1s
24. Example Code Walkthrough: Image Dilation To compile:
nvcc image_dilate.cu -o image_dilate -I$CUDASDK_HOME/C/common/inc -L$CUDASDK_HOME/C/lib -lcutil_x86_64
Usage:
./image_dilate.cu –out=“random_numbers.txt” –n=1024
(generates a random test file to be interpreted as a sqrt(n) by sqrt(n) 1s and 0s)
Then:
./image_dilate.cu –in=“random_numbers.txt”
(performs dilation on CPU and GPU, saves output for CPU and GPU, verifies that the CPU and GPU results are the same)
25. Example Code Walkthrough: Image Dilation Texture memory is allocated by allocating what are known as CUDA Arrays.
We need to provide extra information about our data at allocation time to enable all the nice features we’re getting:
cudaChannelFormatDesc channel_desc =
cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cudaArray* a_before_dilated;
cudaMallocArray(&a_before_dilated, &channel_desc, width, height);
…
cudaFreeArray(a_before_dilated);
26. Example Code Walkthrough: Image Dilation Memcpy API to move data into the Array we allocated:
cudaMemcpyToArray(
a_before_dilated, // array pointer
0, // array offset width
0, // array offset height
h_before_dilated, // source
width*height*sizeof(uchar1), // size in bytes
cudaMemcpyHostToDevice); // type of memcpy
27. Example Code Walkthrough: Image Dilation Accessing array data is done through texture references.
Texture references are views into the underlying Array data
Declaring a texture reference:
texture<uchar1, 2, cudaReadModeElementType> my_tex_ref;
Declares the type of the underlying data, the number of dimensions, and whether or not the data is automatically converted into a normalized float
Texture references must be global variables in your .cu file
28. Example Code Walkthrough: Image Dilation Specifying what type of addressing and interpolation the texture reference should use:
my_tex_ref.addressMode[0] = cudaAddressModeClamp;
my_tex_ref.addressMode[1] = cudaAddressModeClamp;
my_tex_ref.filterMode = cudaFilterModePoint;
my_tex_ref.normalized = false;
29. Example Code Walkthrough: Image Dilation Pointing a texture reference to look at a specific Array:
cudaBindTextureToArray(my_tex_ref, a_before_dilated, channel_desc);
Now we’re ready to access our Array data in a kernel
30. Example Code Walkthrough: Image Dilation Accessing array data in a kernel:
If your indices are not normalized to [0,1], then you need to add 0.5f to each coordinate
Quirk inherited from the graphics pipeline
__global__ void dilate_gpu_texture()
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
uchar1 data_value = tex2D( my_tex_ref, x+0.5f, y+0.5f );
}
31. Example Code Walkthrough: Image Dilation Launching our kernel:
We want as many threads as we have independent tasks
Since each output pixel is independent of each other output pixel, it makes sense to have one thread per output pixel
So, we need enough threads to “cover” our output image
32. Example Code Walkthrough: Image Dilation Launching our kernel:
We also want as many threads as possible in each block
If one thread is stalled, we want to be able to switch to another one
33. Example Code Walkthrough: Image Dilation Launching our kernel:
Since the texture cache is optimized for 2D locality, it makes sense to arrange our thread blocks into 2D tiles
Each thread is responsible for writing one output pixel
dim3 blockDims( BLOCK_WIDTH, BLOCK_HEIGHT, 1 );
dim3 gridDims( (width+blockDims.x-1) / blockDims.x,
(height+blockDims.y-1) / blockDims.y, 1 );
dilate_gpu_texture<<<gridDims,blockDims>>>();