B. Wilkinson, Nov 10, 2014, MemCoalescing

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B. Wilkinson, Nov 10, 2014, MemCoalescing.ppt

Memory coalescing is combining separate memory accesses into one combined access – it is done by the GPU when the locations are sequential locations in global memory banks. Consider setting the elements of two-dimensional array to given data values. This could be done across rows or down columns In the following code, we will demonstrate the effects of each approach

Experiment • Load thread ID (flattened global threadID) into array element so one can tell which thread accesses which location when array printed out. Do it a large number of time times. This simulates a calculation. • For comparison purposes, access done: • Access done across rows • Access done across column • Time of execution compared. In practice, a problem may dictate the access order • GPU structure -- one or more 2-D blocks in a 2-D grid. • Each block, 2-D 32x32 threads fixed (1024, max. compute cap. 2/3)

One way __global__ void gpu_Comput1 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; // thread ID int index = col + row * N; // array index for (int t = 0; t < T; t++) // loop to reduce other time effects h[index] = threadID; // load array with global thread ID }

Another way __global__ void gpu_Comput2 (int *h, int N, int T) { int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; // thread ID int index = row + col * N; // array index for (int t = 0; t < T; t++) // loop to reduce other time effects h[index] = threadID; // load array with global thread ID }

/* ------------------------- GPU Computation 1 -----------------------------------*/ gpu_Comput1<<< Grid, Block >>>(dev_h, N, T); // launch once kernel outside timing cudaEventRecord( start, 0 ); gpu_Comput1<<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); // measure end time cudaEventSynchronize( stop ); // wait for event recording cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Results to check printf("\nComputation with memory coalescing possible\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 similar

Some results A grid of one block and one iteration Array 32x32 No speedup recorded because time of other operations dominate execution time

A grid of one block and 1000000 iterations Array 32 x 32 Speedup = 17.16

Repeat just to check results are consistent

A grid of 16 x 16 blocks and 10000 iterations Array 512x512 Speedup = 12.08 Different numbers of iterations produce similar results

Different Array Sizes 1000 iterations. Block size 32 x 32. Number of blocks to suit array size

Effects of memory access in matrix multiplication One thread is responsible for computing one result Cij and needs access a row of A and a column of B: Thread Each thread access one row of A and one column of B N2 row/column combinations, N2 threads

Seen another way, in first time period, each thread accesses the first element in a row of A: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different rows Given the row-major order of how A is stored, those threads will locations are not in consecutive locations – Bad cannot do memory coalescing. Question: how many threads access the same location?

Next, each thread accesses the first element in a column of B: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different columns Given the row-major order of how A is stored, those threads will locations are in consecutive locations. – Good! Can do memory coalescing. Question: how many threads access the same location?

How can we get better memory accesses and memory coalcesing? • Transpose one array • Copy all rows of A to columns and all columns of A to rows before access A and modify program according. Personally I have not found this to help because of the overhead of doing the transpose sequentially

Sequential code for a transpose using same array: for (i=0; i < N; i++) for (j=0; j < i; j++) { temp = B[i][j]; B[i][j] = b[j][i]; B[j][i] = temp; } (In my code, I use separate arrays) Could be done on host prior to copying to device. How would the code look like if on device?

/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/ transposeArray(a, a_T, N); // transpose array cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose // cudaEventSynchronize(start); // Needed? cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice); // cpy transp. A cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N); cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms2, start, stop ); printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time

Some results 8 x 8 array 1 block of 8 x 8 threads Speedup = 1.62 over not transposing array

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 1.17 over not transposing array

Some results 256 x 256 array 8 blocks of 32 x 32 threads Speedup = 0.89!! over not transposing array

Some results 1024 x 1024 array 32 blocks of 32 x 32 threads Speedup = 0.93!! over not transposing array

Questions

B. Wilkinson, Nov 10, 2014, MemCoalescing

B. Wilkinson, Nov 10, 2014, MemCoalescing

Presentation Transcript

Dr. Shelley Wilkinson 18th June 2014

Thursday , Nov. 10

10 Nov

Grid Computing, B. Wilkinson, 2005

Nov. 10, 2004

Grid Computing, B. Wilkinson, 2005

Grid Computing, B. Wilkinson, 2005

CEL, Nov 10

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 31, 2011 MemCoalescing

Nov 10, 2011

Nov 12, 2014

Nov. 10 , 2014 8 th Grade Science

Nov. 10, 2014 7 th Grade Science

Grid Computing, B. Wilkinson, 2005

Nov 19, 2014

ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel

DO NOW – Nov. 10, 2014

ITCS 5/4145 Parallel computing, B. Wilkinson, April 17, 2014. CUDAMultiDimBlocks

Grid Computing, B. Wilkinson, 2005

B. Wilkinson, Nov 19, 2013, GPUMemories