360 likes | 544 Views
Programming with Tiles. Jia Guo , Ganesh Bikshandi*, Basilio B. Fraguela + , Maria J. Garzaran, David Padua. University of Illinois at Urbana-Champaign *IBM, India + Universidade da Coruna, Spain. Motivation . The importance of tiles A natural way to express many algorithms
E N D
Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela+, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM, India +Universidade da Coruna, Spain
Motivation • The importance of tiles • A natural way to express many algorithms • Partitioning data is an effective way to enhance locality • Pervasive in parallel computing • Limitations of today’s programming language • Lack of programming construct to express tiles directly • Complicated indexing and loop structure to traverse an array by tiles • No support for storage by tile. • Mixed the design and detailed implementation.
Contributions • Our group developed Hierarchically Tiled Arrays (HTAs) to support tiles to enhance locality and parallelism (PPoPP 06). • Designed new language constructs based on our true experiences. • Dynamic partitioning • Overlapped tiling • Evaluated both the productivity and performance
Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions
HTA overview A + B A(0,0) A(0,1)[0,1] op A(0, 0:1) = B (1, 0:1) A(1, 0:1) A
HTA library implementation • In Matlab • In C++ • Support sequential and parallel execution on top of MPI and TBB. • Support for linear, non-linear data layouts • Execution model • SPMD • Programming model • Single threaded
Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions
Why dynamic partitioning? • Some linear algebra algorithms • Add and remove partitions to sweep through matrices. • Cache oblivious algorithms • Create the partition following a divide and conquer strategy
Syntax and semantics • Partition lines and partition • Two methods • void part( Tuple sourcePartition, Tuple offset ); • void rmPart( Tuple partition ); NONE for not partitioning fixed 0 1 2 0 1 2 3 0 1 2 0 0 0 1 1 2 1 2 3 2 A A.part((1,1),(2,2)) A.rmPart((1,1))
LU algorithm with dynamic partitioning done done done done nb In the loop A11 A12 Partially updated done done Partially updated A21 A22 Beginning of iteration nb End of iteration Repartition Update A.part((1,1),(nb, nb)) A.rmPart((1,1))
UPD UPD UPD UPD LU algorithm represented in FLAME [Geijn, et al] The FLAME algorithm In the loop
From algorithm to HTA program The FLAME algorithm The HTA implementation void lu(HTA<double,2> A, HTA<int,1> p,int nb) { A.part((0,0),(0,0)); p.part((0), (0)); while(A(0,0).lsize(1)<A.lsize(1)){ int b = min(A(1,1).lsize(0), nb); A.part((1,1),(b,b)); p.part((1), (b)); dgetf2(A(1:2,1), p(1)); dlaswp(A(1:2,0), p(1)); dlaswp(A(1:2,2), p(1)); trsm(HtaRight, HtaUpper, HtaNoTrans, HtaUnit, One, A(1,1),A(1,2)); gemm(HtaNoTrans, HtaNoTrans, MinusOne, A(2,1), A(1,2), One, A(2,2)); A.rmPart((1,1)); p.rmPart((1)); } }
FLAME API vs. HTA A.part((0,0), (0,0)); A.part((1,1), (b,b)); A.rmPart((1,1)); A(1:2, 1); HTA: 1). General. 2). Fewer variables. 3). Simple index. 4). Flexible range selection
A cache oblivious algorithm:parallel merge 1. Take middle element in in1 in1 3 4 11 14 22 22 30 32 41 2. Find lower bound in in2 in2 2 4 10 15 18 23 23 28 40 3. Calculate partition in out out 4. Partition (logically) 3 4 11 14 22 30 32 41 2 4 10 15 18 23 28 40 5. Merge in parallel
HTA vs. Treading Building Blocks (TBB) HTA code TBB code map(PMerge(), out, in1, in2); • HTA: partition tiles and operate on tiles in parallel • TBB: prepare the merge range and operate on smaller ranges in parallel parallel_for(PMergeRange(begin1, end1, begin2, end2, out), PMergeBody());
Recursive MMM Recursive LU Dynamic LU Sylvester
Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions
Motivation • Programs that have computations based on neighboring points • E.g. iterative PDE solvers • Benefit from tiling • Shadow regions • Problems with current approach • Explicit allocation and update for shadow regions • Example: NAS MG (Lines of code)
Overlapped tiling • Objective • Automate the allocation and update of shadow regions • Allow the access of neighbors across neighboring tiles • Allow programmer to specify overlapped region at creation time Overlap<DIM> ol = Overlap<DIM>(Tuple negativeDir, Tuple positiveDir, BoundaryMode mode, bool autoUpdate=true);
Shadow region for Shadow region for Shadow region for 0 0 Example of overlapped tiling Boundary Boundary overlap overlap Tuple::seq tiling = ((4), (3)); Overlap<DIM> ol((1), (1), zero); A=HTA<double,1>::alloc(1, tiling, array, ROW, ol);
Indexing and operations • Indexing • Operations (+,-, map, =, etc) • The conformability rules only apply to owned regions • Enable operations with non-overlapped HTA. T[0:3]=T[ALL] T[-1] T[4] owned region
Shadow region consistency • Ensure shadow regions to be properly updated and consistent with the corresponding data in the owned tiles. • Use update on read policy • Bookkeeping the status of each tile. • SPMD mode: no communication is needed. • Allow manual and automatic updates.
Evaluation A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4 GB of RAM. We used one processor per node in our experiments with Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and HTA code was compiled with g++ (3.3). O3 flag is used for both cases.
MG comm3 in HTA without overlapped tiling MG comm3 in HTA with overlapped tiling Overlap<3> * ol = new Overlap<3> (T<3>(1,1,1),T<3>(1,1,1), PERIODIC); void comm3 (Grid& u) { int NX = u.shape().size()[0] - 1; int NY = u.shape().size()[1] - 1; int NZ = u.shape().size()[2] - 1; int nx = u(T(0,0,0)).shape().size()[0] - 1; int ny = u(T(0,0,0)).shape().size()[1] - 1; int nz = u(T(0,0,0)).shape().size()[2] - 1; //north-south Traits::Default::async(); if (NX > 0) u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); if (NX > 0) u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); Traits::Default::sync(); Traits::Default::async(); //east-west if (NY > 0) u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1))); u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1))); if (NY > 0) u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1))); u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1))); Traits::Default::sync(); Traits::Default::async(); //front-back if (NZ > 0) u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1))); u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1))); if (NZ > 0) u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); Traits::Default::sync(); }
MG comm3 in NAS (Fortran + MPI) subroutine take3( axis, dir, u, n1, n2, n3 ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer buff_id, indx integer status(mpi_status_size), ierr integer i3, i2, i1 call mpi_wait( msg_id( axis, dir, 1 ),status,ierr) buff_id = 3 + dir indx = 0 if( axis .eq. 1 )then if( dir .eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo enddo endif endif if( axis .eq. 2 )then if( dir .eq. -1 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo enddo endif endif if( axis .eq. 3 )then if( dir .eq. -1 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo enddo endif endif return end subroutine comm3(u,n1,n2,n3,kk) implicit none include 'mpinpb.h' include 'globals.h' integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis if( .not. dead(kk) )then do axis = 1, 3 if( nprocs .ne. 1) then call ready( axis, -1, kk ) call ready( axis, +1, kk ) call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else call zero3(u,n1,n2,n3) endif return end subroutine give3( axis, dir, u, n1, n2, n3, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id buff_id = 2 + dir buff_len = 0 if( axis .eq. 1 )then if( dir .eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir .eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif endif if( axis .eq. 2 )then if( dir .eq. -1 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir .eq. +1 ) then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif endif if( axis .eq. 3 )then if( dir .eq. -1 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir .eq. +1 ) then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif endif return end give3 take3 subroutine ready( axis, dir, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, k integer buff_id,buff_len,i,ierr buff_id = 3 + dir buff_len = nm2 do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me call mpi_irecv( buff(1,buff_id), buff_len, > dp_type, nbr(axis,-dir,k), msg_type(axis,dir), > mpi_comm_world, msg_id(axis,dir,1), ierr) return end ready
NAS MG class C 3D Jacobi NAS LU class C NAS LU class B
Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions
Three metrics • Programming effort [Halstead, 1977] • Program volume V • A function of the number of operators and operands and their total number of occurrences. • Potential volume V* • The most succinct form in a language which has defined or implemented the required operations. • Program complexity [McCabe,1976] • C = P + 1, where P is the number of decision points in the program • Source lines of codes L
Related work • FLAME API [Bientinesi et al., 2005] • Ad-hoc notations • Sequoia [Fatahalian et al., 2006] • Principal construct: task • HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998] • Tiles are used for distribution • Do not address different levels of memory hierarchy • POOMA [Reynders,1996] • Tiles and shadow regions are accessed as a whole • Global Arrays[Nieplocha et al., 2006] • SPMD programming model
Conclusion • HTA makes tiles part of a language • It provides generalized framework to express tiles. • It increases productivity • Less index calculation, fewer variables, loops, simpler function interface • Dynamic partitioning • Overlapped tiling • Little performance degradation
Code example: 2D Jacobi Without overlapped tiling With overlapped tiling • HTA takes care of allocation of shadow regions, data consistency • Clean indexing syntax
Two types of stencil computation • Concurrent computation • Each tile can be executed independently • Example: Jacobi, MG • Wavefront computation • The execution sequence of tiles follows a certain order. • Example: LU, SSOR • With the update on read policy, minimal number of communications is achieved. computation Shadow region update in every iteration Shadow region update in second iteration computation