1 / 1

A Library of data-copy and data-layout optimizations.

A Library of data-copy and data-layout optimizations. Malik Muhammad Zaki Murtaza Khan, Chun Chen, Jacqueline Chame, Mary Hall. {murtaza,jchame}@isi.edu, {mhall,chunchen}@cs.utah.edu .

silas-flynn
Download Presentation

A Library of data-copy and data-layout optimizations.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Library of data-copy and data-layout optimizations. Malik Muhammad Zaki Murtaza Khan, Chun Chen, Jacqueline Chame, Mary Hall. {murtaza,jchame}@isi.edu, {mhall,chunchen}@cs.utah.edu Funded by NSF awards CSR-0615412, DOE grant DE-FC02-06ER25765 and by a gift from Intel Corporation. Motivation & Overview Data copy Motivation Data copy library • Data copy dynamically rearranges data layouts by copying data between different memory locations, possibly in different memory structures. • Conventional architecture: ♦ Improves locality and avoids cache/memory-bank conflicts. • Accelerators: • ♦ Copying data into memory of accelerators (e.g., FPGAs and GPUS) a precursor to execution. • ♦ Rearranging data can improve parallelism of memory accesses and increase memory bandwidth. • ♦ Automating this process can ease the programmer's job of using the accelerator. • What’s the need? ♦ Automatic code generation for a number of different architectures.. ♦Many similar reorganizations required across platforms, but specific code is not portable. • What’s the solution? ♦A library of different copy implementations, with a shared front-end and architecture-specific code generation. ♦Compiler technology to correctly rewrite access expressions automatically. • Modern architectures are incorporating complex memory hierarchies. • Heterogeneous memory (DRAM/SRAM, different latency and bandwidth properties, different access policies) • Software-managed storage • Data copy is an important transformation for orchestrating data movement and reorganization. • A compiler framework can incorporate data copy to optimize memory hierarchy utilization. Representation & Implementation How does it work? Polyhedral Model Polyhedral Framework • A powerful polyhedral model supports rewriting of the affine access expressions. • Multiple data copy implementations form part of library as callable functions. • Script level interface provides the mechanism to use the implementation functions by the compilers and programmers. • Multiple data layouts can be generated for different computations at different stages of optimization. • Polyhedral model provides simple representations of different program structures. • Allows for the iteration domains and statement instances to be viewed as objects in a polyhedron. • Omega library Plus* provides the framework to implements the model. • Efficient loop code generation by the advanced Omega code generation tool. *Omega Library Plus is a new version of old Omega Library, with unique features.[http://www.cs.utah.edu/~chunchen/omega/] DO I=1,N-1 DO J=1,N-1 A(I,J) = ( B(I-1,J) + B(I+1,J) + B(I,J-1)+ B(I,J+1) )/4 Simple relations, based on linear equations, representing indices in the array accesses. Polyhedral framework producing a simple representation of iteration domains. { Sym=[n] [_t1,_t2] : 1 <= _t2 < n && 1 <= _t1 < n } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = 1+Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && 1+_t2 = Out_2 }……. Without framework support (from original iteration space): for(t1 = 1; t1 <= n; t1++) { s1(t1,1); s2(t1,1); if (t1 <= 1) { s3(1,1); } for(t2 = 2; t2 <= t1-1; t2++) { s2(t1,t2); } if (t1 >= 2) { s2(t1,t1); s3(t1,t1); } } source:jacobi.sp2 procedure: 0 … … unroll(...) datacopy(...) (a) Script Interface {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 } {[_t1,_t2] -> [Out_1,Out_2] : _t1 = Out_1 && _t2 = Out_2 }……. Array access expressions can be modified to implement required optimizations DO I=1,N-1 · · · = D0[i]; · · · = D1[i]; DO J=1,N-1 · · · = · · · + B0[i+j]*· · ·; // u(0,0) · · · = · · · + B1[i+j]*· · ·; // u(0,1) · · · = · · · + · · · ∗ · · ·; // u(1,0) · · · = · · · + B0[i+j+1]*· · ·; // u(1,1) } D1[i] = · · ·; D0[i] = · · ·; } Modified array references or newly created arrays help set up the storage for the data copy optimization. Code Example: Jacobi Data copy library with Auto-tuning Compiler Memory Layout for FPGA. • Data copy forms part of an auto-tuning compiler framework. • Auto-tuning compiler, while using the library, can empirically evaluate the different implementations. • Future extensions of the library will be easily incorporated by the auto-tuning framework. Multiple memory banks in an FPGA • Exploiting memory parallelism offered by multiple memory banks. • Unroll the loops. • Apply scalar replacement • Doing reuse analysis. • Partitioning Array references in different memory banks according to access patterns. int A[34][18]; int B[34][18]; for (i = 1; i < 33; i++) for (j = 1; j < 17; j++) { A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1])/ 4; } (a) Jacobi unannotated kernel for (i = 1; i < 33; i+=2) /* unroll by 1 */ for (j = 1; j < 17; j+=2) { /* unroll by 1 */ A[i][j] = (B[i + 1][j] + B[i − 1][j] + B[i][j + 1] + B[i][j − 1]) / 4; A[i][j + 1] = (B[i + 1][j + 1] + B[i − 1][j + 1] + B[i][j + 2] + B[i][j]) / 4; A[i + 1][j] = (B[i + 2][j] + B[i][j] + B[i + 1][j + 1] + B[i + 1][j − 1]) / 4; A[i + 1][j + 1] = (B[i + 2][j + 1] + B[i][j + 1] + B[i + 1][j + 2] + B[i + 1][j]) / 4; } (b) unroll-and-jam Array references partitioned in different memory banks. Data Copy Function Calls Source Code #define N 16 __global__ void jacobi_GPU(int* a, int* c) { __shared__ float b[4][4]; int thidx = ....; int thidy = ....; if (blockIdx.x == 0) { if(threadIdx.x == 0 ) b[...]= a[...]; if(threadIdx.x == 0 && (blockIdx.y == threadIdx.y)) b[...]= a[...] ; if(threadIdx.x == 0 && threadIdx.y == 0 && blockIdx.y == 0) b[...]= a[...]; } …. if (thidx > 0 && thidx < N-1 && thidy > 0 && thidy < N-1 ) c[...] = 0.8* (b[...] + b[...] + b[...] + b[...]); } int main(){ ..... dim3 dimBlock(N/2,N/2); dim3 dimGrid(N/4, N/4); cudaMalloc((void **)&a_gpu,N*N*sizeof(int)); cudaMalloc((void **)&c_gpu,N*N*sizeof(int)); cudaMemcpy(a_gpu,a,N*N*sizeof(int),cudaMemcpyHostToDevice); jacobi_GPU<<<dimGrid,dimBlock,((N/2)+2)*((N/2)+2)*sizeof(int)>>>(a_gpu,c_gpu); cudaMemcpy(cr_gpu,c_gpu,N*sizeof(int),cudaMemcpyDeviceToHost); .... return 0; } Data copy library Memory Layout for GPU. Auto-tuning Compiler Framework Data Copy Function Calls • Exploiting parallelism in memory hierarchy of a GPU. • Split a task into subtasks. • Divide input data in blocks that fit shared memory. • Copy from global memory into shared memory. • Copy results from shared memory back to global memory. Shared memory space allocation Memory hierarchy in a GPU Data Copy Implementations Optimized Code Copying data to the shared memory 1- Custom Data Layout for Memory Parallelism. Byoungro So, Mary Hall, and Heidi Ziegler. (CGO'04), Palo Also, CA, March 20-24, 2004

More Related