1 / 22

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS. Wenjing Ma Gagan Agrawal The Ohio State University. GPGPU. General Purpose Programming on GPUs (accelerators) ‏ High performance/price ratio High language support CUDA Performance vs Productivity

Download Presentation

AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University

  2. GPGPU • General Purpose Programming on GPUs (accelerators)‏ • High performance/price ratio • High language support • CUDA • Performance vs Productivity • Hard to program • Memory hierarchy to manage • ...

  3. And Make the Programming Simple! Get High Performance from GPU • Automatic code generation • Device memory access is expensive • Using shared memory • Texture and constant memory • Coalescing device memory access • ...

  4. FEATURES OF SHARED MEMORY • Small, fast, like a cache • 16KB on each multiprocessor (no more than 48KB even on the latest GPU) • Read-write • Software controlled • __shared__ float data[n][n]; • Allocating shared memory: • Similar to register allocation

  5. Problem Formulation for Shared Memory Arrangement • Consider variables and basic blocks in a function • Element of array, array, section of array • Each variable can have several live ranges in the function • Access feature of live range: read, write, read-write, temp • Determine in which basic block a variable is allocated to shared memory • Assign_point[i][k]: variable i, basic block k

  6. Integer Programming Problem • Integer Linear Programming • Objective function • Maximize z = CT x • Constraints • Solution • Values of x • Special case of linear programming • All the unknown variables are integers (1-0 in our case)‏ • Solvable for reasonable size of problems

  7. Integer Programming for Shared Memory Arrangement • Objective Function • Maximize shared memory usage • Minimize data transfer between memory hierarchies

  8. Integer Programming for Shared Memory Arrangement (cnt’d)‏ • Objective Function

  9. An Example to Show size_alloc for (int i=0; i<n; i++)‏ for (int j=0; j<m; j++)‏ for (int k = 0; k<r; k++)‏ C[k] += A[i][k]- B[j][k]; ......

  10. Integer Programming for Shared Memory Arrangement (cnt’d)‏ • Constraints • Total allocation does not exceed the limit of shared memory at any time • Only at most one assign_point is 1 in each live range

  11. Integer Programming for Shared Memory Arrangement (cnt’d)‏ • Obtaining parameters • Using LLVM compiler framework • Pass 1: get access features • Read, write, read-write, temp • Pass 2: get live ranges, loop information, indices, and all other parameters

  12. Code Generation • According to the shared memory arrangement obtained from the integer programming model • Under the framework in previous work • Move data to cover gap caused by data evicted from shared memory

  13. An Example for (int i=0; i<n; i++)‏ for (int j=0; j<m; j++)‏ for (int k = 0; k<r; k++)‏ C[k] += A[i][k]- B[j][k]; ...... Integer Programming Solver A: n*r B: m*r C: r n: 2048 m: 3 r: 3 NUM_THREADS: 256 assign_point[0][1]=1; assign_point[1][0]=1; assign_point[2][0]=1; /* all other elements of assign_point are 0 */

  14. An Example (cnt’d)‏ Generated Code: __shared__ float s_B[m][r]; __shared__ float s_C[r*NUM_THREADS]; __shared__ float s_A[r*NUM_THREADS]; for(int i=0;i<m*r;i++) s_B[i]=B[i]; for(int i=0;i<n;i+=NUM_THREADS) { for(int j=0;j<r;j++)‏ s_A[tid*r+j]=A[tid+i][j]; for(int j=0;j<m;j++)‏ for(int k=0;k<r;k++)‏ s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k]; ...... } /* Synchronize and combination of C */

  15. Suggesting Loop Transformation for (int rc = 0; rc < nRowCl; rc++) { tempDis = 0; for(int c = 0;c<numCol;c++)‏ tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]]; } for (int rc = 0; rc < nRowCl; rc++) tempDis[rc] = 0; for(int c = 0;c<numCol;c++)‏ { /* load into shared memory */ for (int rc = 0; rc < nRowCl; rc++)‏ { tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]]; } }

  16. Experiments • Effectiveness of using shared memory • Compare with intuitive approach in previous work • Greedy sorting: sort all the variables in increasing order of size, and allocation them on shared memory until to the limit of shared memory • Effectiveness of loop transformation suggested by the integer programming model

  17. Experiment Results

  18. Experiment Results • K-means EM

  19. Experiment Results (cnt’d)‏ • PCA Co-clustering

  20. Effect of Loop Transformation • PCA Co-clustering

  21. Conclusion and Future Work • Proposed an integer programming model for shared memory arrangement on GPU • Consider numeric variable, array, and section of array • Suggested loop transformation for optimization • Got better results than the intuitive method • Will automate the code generation and loop transformation selection in future

  22. THANK YOU! Questions?

More Related