220 likes | 394 Views
AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS. Wenjing Ma Gagan Agrawal The Ohio State University. GPGPU. General Purpose Programming on GPUs (accelerators) High performance/price ratio High language support CUDA Performance vs Productivity
E N D
AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University
GPGPU • General Purpose Programming on GPUs (accelerators) • High performance/price ratio • High language support • CUDA • Performance vs Productivity • Hard to program • Memory hierarchy to manage • ...
And Make the Programming Simple! Get High Performance from GPU • Automatic code generation • Device memory access is expensive • Using shared memory • Texture and constant memory • Coalescing device memory access • ...
FEATURES OF SHARED MEMORY • Small, fast, like a cache • 16KB on each multiprocessor (no more than 48KB even on the latest GPU) • Read-write • Software controlled • __shared__ float data[n][n]; • Allocating shared memory: • Similar to register allocation
Problem Formulation for Shared Memory Arrangement • Consider variables and basic blocks in a function • Element of array, array, section of array • Each variable can have several live ranges in the function • Access feature of live range: read, write, read-write, temp • Determine in which basic block a variable is allocated to shared memory • Assign_point[i][k]: variable i, basic block k
Integer Programming Problem • Integer Linear Programming • Objective function • Maximize z = CT x • Constraints • Solution • Values of x • Special case of linear programming • All the unknown variables are integers (1-0 in our case) • Solvable for reasonable size of problems
Integer Programming for Shared Memory Arrangement • Objective Function • Maximize shared memory usage • Minimize data transfer between memory hierarchies
Integer Programming for Shared Memory Arrangement (cnt’d) • Objective Function
An Example to Show size_alloc for (int i=0; i<n; i++) for (int j=0; j<m; j++) for (int k = 0; k<r; k++) C[k] += A[i][k]- B[j][k]; ......
Integer Programming for Shared Memory Arrangement (cnt’d) • Constraints • Total allocation does not exceed the limit of shared memory at any time • Only at most one assign_point is 1 in each live range
Integer Programming for Shared Memory Arrangement (cnt’d) • Obtaining parameters • Using LLVM compiler framework • Pass 1: get access features • Read, write, read-write, temp • Pass 2: get live ranges, loop information, indices, and all other parameters
Code Generation • According to the shared memory arrangement obtained from the integer programming model • Under the framework in previous work • Move data to cover gap caused by data evicted from shared memory
An Example for (int i=0; i<n; i++) for (int j=0; j<m; j++) for (int k = 0; k<r; k++) C[k] += A[i][k]- B[j][k]; ...... Integer Programming Solver A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */
An Example (cnt’d) Generated Code: __shared__ float s_B[m][r]; __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<m*r;i++) s_B[i]=B[i]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++) s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++) for(int k=0;k<r;k++) s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ...... } /* Synchronize and combination of C */
Suggesting Loop Transformation for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++) tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0; for(int c = 0;c<numCol;c++) { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++) { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; } }
Experiments • Effectiveness of using shared memory • Compare with intuitive approach in previous work • Greedy sorting: sort all the variables in increasing order of size, and allocation them on shared memory until to the limit of shared memory • Effectiveness of loop transformation suggested by the integer programming model
Experiment Results • K-means EM
Experiment Results (cnt’d) • PCA Co-clustering
Effect of Loop Transformation • PCA Co-clustering
Conclusion and Future Work • Proposed an integer programming model for shared memory arrangement on GPU • Consider numeric variable, array, and section of array • Suggested loop transformation for optimization • Got better results than the intuitive method • Will automate the code generation and loop transformation selection in future
THANK YOU! Questions?