ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 31, 2011 MemCoalescing

Memory Coalescing These notes will demonstrate the effects of memory coalescing ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 31, 2011 MemCoalescing.ppt

Memory coalescing is combining separate memory accesses into one combined access – it is done by the GPU when the locations are sequential locations in global memory banks. Consider setting the elements of two-dimensional array to given data values. This could be done across rows or down columns In the following code, we will demonstrate the effects of each approach

Approach Load numbers into a two-dimensional array Flattened global threadID of thread is loaded into array element being accessed so one can tell which thread accesses which location when one prints out array Access is done across rows and also across column and the time of execution compared. In practice, a problem may dictate the access order GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x)

One way __global__ void gpu_Comput1 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; // thread ID int index = col + row * N; // array index for (int t = 0; t < T; t++) // loop to reduce other time effects h[index] = threadID; // load array with global thread ID }

Another way __global__ void gpu_Comput2 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; // thread ID int index = row + col * N; // array index for (int t = 0; t < T; t++) // loop to reduce other time effects h[index] = threadID; // load array with global thread ID }

/* ------------------------- GPU Computation 1 -----------------------------------*/ gpu_Comput1<<< Grid, Block >>>(dev_h, N, T); // launch once kernel outside timing cudaEventRecord( start, 0 ); gpu_Comput1<<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); // measure end time cudaEventSynchronize( stop ); // wait for event recording cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Results to check printf("\nComputation with memory coalescing possible\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 similar

Some results A grid of one block and one iteration Array 32x32 No speedup recorded because time of other operations dominate execution time

A grid of one block and 1000000 iterations Array 32 x 32 Speedup = 17.16

Repeat just to check results are consistent

A grid of 16 x 16 blocks and 10000 iterations Array 512x512 Speedup = 12.08 Different numbers of iterations produce similar results

Different Array Sizes 1000 iterations. Block size 32 x 32. Number of blocks to suit array size

Questions

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 31, 2011 MemCoalescing

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 31, 2011 MemCoalescing

Presentation Transcript

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt

RENCI@UNC Charlotte

UNC Charlotte

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013

UNC Charlotte Employees

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25 , 2011 DeviceRoutines.pptx

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomicsx

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streamsx

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011 MatrixMult

ITCS 4/5010 Parallel Programming, B. Wilkinson, Jan 21, 2013. CUDADeviceRoutines

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 16, 2013.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 13A, 2013, OpenACC

ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011

UNC Charlotte

ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, April 4, 2013 CUDAProgModel