10 likes | 89 Views
A Library of data-copy and data-layout optimizations. Malik Muhammad Zaki Murtaza Khan, Chun Chen, Jacqueline Chame, Mary Hall. {murtaza,jchame}@isi.edu, {mhall,chunchen}@cs.utah.edu .
E N D
A Library of data-copy and data-layout optimizations. Malik Muhammad Zaki Murtaza Khan, Chun Chen, Jacqueline Chame, Mary Hall. {murtaza,jchame}@isi.edu, {mhall,chunchen}@cs.utah.edu Funded by NSF awards CSR-0615412, DOE grant DE-FC02-06ER25765 and by a gift from Intel Corporation. Motivation & Overview Data copy Motivation Data copy library • Data copy dynamically rearranges data layouts by copying data between different memory locations, possibly in different memory structures. • Conventional architecture: ♦ Improves locality and avoids cache/memory-bank conflicts. • Accelerators: • ♦ Copying data into memory of accelerators (e.g., FPGAs and GPUS) a precursor to execution. • ♦ Rearranging data can improve parallelism of memory accesses and increase memory bandwidth. • ♦ Automating this process can ease the programmer's job of using the accelerator. • What’s the need? ♦ Automatic code generation for a number of different architectures.. ♦Many similar reorganizations required across platforms, but specific code is not portable. • What’s the solution? ♦A library of different copy implementations, with a shared front-end and architecture-specific code generation. ♦Compiler technology to correctly rewrite access expressions automatically. • Modern architectures are incorporating complex memory hierarchies. • Heterogeneous memory (DRAM/SRAM, different latency and bandwidth properties, different access policies) • Software-managed storage • Data copy is an important transformation for orchestrating data movement and reorganization. • A compiler framework can incorporate data copy to optimize memory hierarchy utilization. Representation & Implementation How does it work? Polyhedral Model Polyhedral Framework • A powerful polyhedral model supports rewriting of the affine access expressions. • Multiple data copy implementations form part of library as callable functions. • Script level interface provides the mechanism to use the implementation functions by the compilers and programmers. • Multiple data layouts can be generated for different computations at different stages of optimization. • Polyhedral model provides simple representations of different program structures. • Allows for the iteration domains and statement instances to be viewed as objects in a polyhedron. • Omega library Plus* provides the framework to implements the model. • Efficient loop code generation by the advanced Omega code generation tool. *Omega Library Plus is a new version of old Omega Library, with unique features.[http://www.cs.utah.edu/~chunchen/omega/] DO I=1,N-1 DO J=1,N-1 A(I,J) = ( B(I-1,J) + B(I+1,J) + B(I,J-1)+ B(I,J+1) )/4 Simple relations, based on linear equations, representing indices in the array accesses. Polyhedral framework producing a simple representation of iteration domains. { Sym=[n] [_t1,_t2] : 1 <= _t2 < n && 1 <= _t1 < n } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = 1+Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && 1+_t2 = Out_2 }……. Without framework support (from original iteration space): for(t1 = 1; t1 <= n; t1++) { s1(t1,1); s2(t1,1); if (t1 <= 1) { s3(1,1); } for(t2 = 2; t2 <= t1-1; t2++) { s2(t1,t2); } if (t1 >= 2) { s2(t1,t1); s3(t1,t1); } } source:jacobi.sp2 procedure: 0 … … unroll(...) datacopy(...) (a) Script Interface {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 }……. Array access expressions can be modified to implement required optimizations DO I=1,N-1 · · · = D0[i]; · · · = D1[i]; DO J=1,N-1 · · · = · · · + B0[i+j]*· · ·; // u(0,0) · · · = · · · + B1[i+j]*· · ·; // u(0,1) · · · = · · · + · · · ∗ · · ·; // u(1,0) · · · = · · · + B0[i+j+1]*· · ·; // u(1,1) } D1[i] = · · ·; D0[i] = · · ·; } Modified array references or newly created arrays help set up the storage for the data copy optimization. Code Example: Jacobi Data copy library with Auto-tuning Compiler Memory Layout for FPGA. • Data copy forms part of an auto-tuning compiler framework. • Auto-tuning compiler, while using the library, can empirically evaluate the different implementations. • Future extensions of the library will be easily incorporated by the auto-tuning framework. Multiple memory banks in an FPGA • Exploiting memory parallelism offered by multiple memory banks. • Unroll the loops. • Apply scalar replacement • Doing reuse analysis. • Partitioning Array references in different memory banks according to access patterns. int A[34][18]; int B[34][18]; for (i = 1; i < 33; i++) for (j = 1; j < 17; j++) { A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1])/ 4; } (a) Jacobi unannotated kernel for (i = 1; i < 33; i+=2) /* unroll by 1 */ for (j = 1; j < 17; j+=2) { /* unroll by 1 */ A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1]) / 4; A[i][j + 1] = (B[i + 1][j + 1] + B[i − 1][j + 1] + B[i][j + 2] + B[i][j]) / 4; A[i + 1][j] = (B[i + 2][j] + B[i][j] + B[i + 1][j + 1] + B[i + 1][j − 1]) / 4; A[i + 1][j + 1] = (B[i + 2][j + 1] + B[i][j + 1] + B[i + 1][j + 2] + B[i + 1][j]) / 4; } (b) unroll-and-jam Array references partitioned in different memory banks. Data Copy Function Calls Source Code #define N 16 __global__ void jacobi_GPU(int* a, int* c) { __shared__ float b[4][4]; int thidx = ....; int thidy = ....; if (blockIdx.x == 0) { if(threadIdx.x == 0 ) b[...]= a[...]; if(threadIdx.x == 0 && (blockIdx.y == threadIdx.y)) b[...]= a[...] ; if(threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.y == 0) b[...]= a[...]; } …. if (thidx > 0 && thidx < N-1 && thidy > 0 && thidy < N-1 ) c[...] = 0.8* (b[...] + b[...] + b[...] + b[...]); } int main(){ ..... dim3 dimBlock(N/2,N/2); dim3 dimGrid(N/4, N/4); cudaMalloc((void **)&a_gpu,N*N*sizeof(int)); cudaMalloc((void **)&c_gpu,N*N*sizeof(int)); cudaMemcpy(a_gpu,a,N*N*sizeof(int),cudaMemcpyHostToDevice); jacobi_GPU<<<dimGrid,dimBlock,((N/2)+2)*((N/2)+2)*sizeof(int)>>>(a_gpu,c_gpu); cudaMemcpy(cr_gpu,c_gpu,N*sizeof(int),cudaMemcpyDeviceToHost); .... return 0; } Data copy library Memory Layout for GPU. Auto-tuning Compiler Framework Data Copy Function Calls • Exploiting parallelism in memory hierarchy of a GPU. • Split a task into subtasks. • Divide input data in blocks that fit shared memory. • Copy from global memory into shared memory. • Copy results from shared memory back to global memory. Shared memory space allocation Memory hierarchy in a GPU Data Copy Implementations Optimized Code Copying data to the shared memory 1- Custom Data Layout for Memory Parallelism. Byoungro So, Mary Hall, and Heidi Ziegler. (CGO'04), Palo Also, CA, March 20-24, 2004