ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu. ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem.ppt

Approach Objective: As with memory coalescing demo, to load numbers into a two-dimensional array Flattened global threadID of thread loaded into array element so one can tell which thread accesses which location array printed out. For comparison purposes, access done: Using global memory only On shared memory with local 2-D arrays and copying back to global memory As 2. but using separate pointer arithmetic for speed GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x)

1. Using global memory only __global__ void gpu_WithoutSharedMem (int *h, int N, int T) { // Array loaded with global thread ID that accesses that location // Coalescing should be possible int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; for (int t = 0; t < T; t++) // to reduce other time effects h[index] = threadID; // load array with global thread ID }

2. Using shared memory __global__ void gpu_SharedMem (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; // sh. mem. each block int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; // h_local[threadIdx.y][threadIdx.x] = h[index]; Not necessary here // but might be in other caculations for (int t = 0; t < T; t++) h_local[threadIdx.y][threadIdx.x] = threadID; // load array h[index] = h_local[threadIdx.y][threadIdx.x]; //copy back to global mem. }

3. Using shared memory with index calculation outside loop __global__ void gpu_SharedMem_ptr (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; int *ptr; // index calc. once outside loop ptr = &h_local[0][0]; int index_local = threadIdx.x + threadIdx.y * N; for (int t = 0; t < T; t++) ptr[index_local] = threadID; h[index] = h_local[threadIdx.y][threadIdx.x]; } This code I am still checking out

Main program … /*------------------------- Allocate Memory-----------------------------------*/ int size = N * N * sizeof(int); // number of bytes in total in array int *h, *dev_h; // ptr to arrays holding numbers on host and device h = (int*) malloc(size); // Array on host cudaMalloc((void**)&dev_h, size); // allocate device memory /* ------------------------- GPU Computation without shared memory -----------------------------------*/ gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); // once outside timing cudaEventRecord( start, 0 ); gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check printf("\nComputation without shared memory\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 and 3 similar

Some results A grid of one block and one iteration Array 32x32 Shared memory Speedup = 1.18

A grid of one block and 1000000 iterations Array 32 x 32 Shared memory Speedup = 1.24

Repeat just to check results are consistent

A grid of 16 x 16 blocks and 10000 iterations Array 512x512 Speedup = 1.74 Different numbers of iterations produce similar results

Different Array Sizes 1000 iterations. Block size 32 x 32. Number of blocks to suit array size

Questions

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem