1 / 36

Programming with Tiles

Programming with Tiles. Jia Guo , Ganesh Bikshandi*, Basilio B. Fraguela + , Maria J. Garzaran, David Padua. University of Illinois at Urbana-Champaign *IBM, India + Universidade da Coruna, Spain. Motivation . The importance of tiles A natural way to express many algorithms

whitney
Download Presentation

Programming with Tiles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela+, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM, India +Universidade da Coruna, Spain

  2. Motivation • The importance of tiles • A natural way to express many algorithms • Partitioning data is an effective way to enhance locality • Pervasive in parallel computing • Limitations of today’s programming language • Lack of programming construct to express tiles directly • Complicated indexing and loop structure to traverse an array by tiles • No support for storage by tile. • Mixed the design and detailed implementation.

  3. Contributions • Our group developed Hierarchically Tiled Arrays (HTAs) to support tiles to enhance locality and parallelism (PPoPP 06). • Designed new language constructs based on our true experiences. • Dynamic partitioning • Overlapped tiling • Evaluated both the productivity and performance

  4. Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions

  5. HTA overview A + B A(0,0) A(0,1)[0,1] op A(0, 0:1) = B (1, 0:1) A(1, 0:1) A

  6. HTA library implementation • In Matlab • In C++ • Support sequential and parallel execution on top of MPI and TBB. • Support for linear, non-linear data layouts • Execution model • SPMD • Programming model • Single threaded

  7. Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions

  8. Why dynamic partitioning? • Some linear algebra algorithms • Add and remove partitions to sweep through matrices. • Cache oblivious algorithms • Create the partition following a divide and conquer strategy

  9. Syntax and semantics • Partition lines and partition • Two methods • void part( Tuple sourcePartition, Tuple offset ); • void rmPart( Tuple partition ); NONE for not partitioning fixed 0 1 2 0 1 2 3 0 1 2 0 0 0 1 1 2 1 2 3 2 A A.part((1,1),(2,2)) A.rmPart((1,1))

  10. LU algorithm with dynamic partitioning done done done done nb In the loop A11 A12 Partially updated done done Partially updated A21 A22 Beginning of iteration nb End of iteration Repartition Update A.part((1,1),(nb, nb)) A.rmPart((1,1))

  11. UPD UPD UPD UPD LU algorithm represented in FLAME [Geijn, et al] The FLAME algorithm In the loop

  12. From algorithm to HTA program The FLAME algorithm The HTA implementation void lu(HTA<double,2> A, HTA<int,1> p,int nb) { A.part((0,0),(0,0)); p.part((0), (0)); while(A(0,0).lsize(1)<A.lsize(1)){ int b = min(A(1,1).lsize(0), nb); A.part((1,1),(b,b)); p.part((1), (b)); dgetf2(A(1:2,1), p(1)); dlaswp(A(1:2,0), p(1)); dlaswp(A(1:2,2), p(1)); trsm(HtaRight, HtaUpper, HtaNoTrans, HtaUnit, One, A(1,1),A(1,2)); gemm(HtaNoTrans, HtaNoTrans, MinusOne, A(2,1), A(1,2), One, A(2,2)); A.rmPart((1,1)); p.rmPart((1)); } }

  13. FLAME API vs. HTA A.part((0,0), (0,0)); A.part((1,1), (b,b)); A.rmPart((1,1)); A(1:2, 1); HTA: 1). General. 2). Fewer variables. 3). Simple index. 4). Flexible range selection

  14. A cache oblivious algorithm:parallel merge 1. Take middle element in in1 in1 3 4 11 14 22 22 30 32 41 2. Find lower bound in in2 in2 2 4 10 15 18 23 23 28 40 3. Calculate partition in out out 4. Partition (logically) 3 4 11 14 22 30 32 41 2 4 10 15 18 23 28 40 5. Merge in parallel

  15. HTA vs. Treading Building Blocks (TBB) HTA code TBB code map(PMerge(), out, in1, in2); • HTA: partition tiles and operate on tiles in parallel • TBB: prepare the merge range and operate on smaller ranges in parallel parallel_for(PMergeRange(begin1, end1, begin2, end2, out), PMergeBody());

  16. Evaluation

  17. Recursive MMM Recursive LU Dynamic LU Sylvester

  18. Parallel merge

  19. Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions

  20. Motivation • Programs that have computations based on neighboring points • E.g. iterative PDE solvers • Benefit from tiling • Shadow regions • Problems with current approach • Explicit allocation and update for shadow regions • Example: NAS MG (Lines of code)

  21. Overlapped tiling • Objective • Automate the allocation and update of shadow regions • Allow the access of neighbors across neighboring tiles • Allow programmer to specify overlapped region at creation time Overlap<DIM> ol = Overlap<DIM>(Tuple negativeDir, Tuple positiveDir, BoundaryMode mode, bool autoUpdate=true);

  22. Shadow region for Shadow region for Shadow region for 0 0 Example of overlapped tiling Boundary Boundary overlap overlap Tuple::seq tiling = ((4), (3)); Overlap<DIM> ol((1), (1), zero); A=HTA<double,1>::alloc(1, tiling, array, ROW, ol);

  23. Indexing and operations • Indexing • Operations (+,-, map, =, etc) • The conformability rules only apply to owned regions • Enable operations with non-overlapped HTA. T[0:3]=T[ALL] T[-1] T[4] owned region

  24. Shadow region consistency • Ensure shadow regions to be properly updated and consistent with the corresponding data in the owned tiles. • Use update on read policy • Bookkeeping the status of each tile. • SPMD mode: no communication is needed. • Allow manual and automatic updates.

  25. Evaluation A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4 GB of RAM. We used one processor per node in our experiments with Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and HTA code was compiled with g++ (3.3). O3 flag is used for both cases.

  26. MG comm3 in HTA without overlapped tiling MG comm3 in HTA with overlapped tiling Overlap<3> * ol = new Overlap<3> (T<3>(1,1,1),T<3>(1,1,1), PERIODIC); void comm3 (Grid& u) { int NX = u.shape().size()[0] - 1; int NY = u.shape().size()[1] - 1; int NZ = u.shape().size()[2] - 1; int nx = u(T(0,0,0)).shape().size()[0] - 1; int ny = u(T(0,0,0)).shape().size()[1] - 1; int nz = u(T(0,0,0)).shape().size()[2] - 1; //north-south Traits::Default::async(); if (NX > 0) u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); if (NX > 0) u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); Traits::Default::sync(); Traits::Default::async(); //east-west if (NY > 0) u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1))); u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1))); if (NY > 0) u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1))); u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1))); Traits::Default::sync(); Traits::Default::async(); //front-back if (NZ > 0) u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1))); u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1))); if (NZ > 0) u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); Traits::Default::sync(); }

  27. MG comm3 in NAS (Fortran + MPI) subroutine take3( axis, dir, u, n1, n2, n3 ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer buff_id, indx integer status(mpi_status_size), ierr integer i3, i2, i1 call mpi_wait( msg_id( axis, dir, 1 ),status,ierr) buff_id = 3 + dir indx = 0 if( axis .eq. 1 )then if( dir .eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo enddo endif endif if( axis .eq. 2 )then if( dir .eq. -1 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo enddo endif endif if( axis .eq. 3 )then if( dir .eq. -1 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo enddo else if( dir .eq. +1 ) then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo enddo endif endif return end subroutine comm3(u,n1,n2,n3,kk) implicit none include 'mpinpb.h' include 'globals.h' integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis if( .not. dead(kk) )then do axis = 1, 3 if( nprocs .ne. 1) then call ready( axis, -1, kk ) call ready( axis, +1, kk ) call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else call zero3(u,n1,n2,n3) endif return end subroutine give3( axis, dir, u, n1, n2, n3, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id buff_id = 2 + dir buff_len = 0 if( axis .eq. 1 )then if( dir .eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir .eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif endif if( axis .eq. 2 )then if( dir .eq. -1 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir .eq. +1 ) then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif endif if( axis .eq. 3 )then if( dir .eq. -1 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir .eq. +1 ) then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif endif return end give3 take3 subroutine ready( axis, dir, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, k integer buff_id,buff_len,i,ierr buff_id = 3 + dir buff_len = nm2 do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me call mpi_irecv( buff(1,buff_id), buff_len, > dp_type, nbr(axis,-dir,k), msg_type(axis,dir), > mpi_comm_world, msg_id(axis,dir,1), ierr) return end ready

  28. NAS MG class C 3D Jacobi NAS LU class C NAS LU class B

  29. Outline • Introduction of HTA • Dynamic partitioning • Overlapped tiling • Impact on programming productivity • Related work • Conclusions

  30. Three metrics • Programming effort [Halstead, 1977] • Program volume V • A function of the number of operators and operands and their total number of occurrences. • Potential volume V* • The most succinct form in a language which has defined or implemented the required operations. • Program complexity [McCabe,1976] • C = P + 1, where P is the number of decision points in the program • Source lines of codes L

  31. Evaluation

  32. Related work • FLAME API [Bientinesi et al., 2005] • Ad-hoc notations • Sequoia [Fatahalian et al., 2006] • Principal construct: task • HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998] • Tiles are used for distribution • Do not address different levels of memory hierarchy • POOMA [Reynders,1996] • Tiles and shadow regions are accessed as a whole • Global Arrays[Nieplocha et al., 2006] • SPMD programming model

  33. Conclusion • HTA makes tiles part of a language • It provides generalized framework to express tiles. • It increases productivity • Less index calculation, fewer variables, loops, simpler function interface • Dynamic partitioning • Overlapped tiling • Little performance degradation

  34. Thank you!

  35. Code example: 2D Jacobi Without overlapped tiling With overlapped tiling • HTA takes care of allocation of shadow regions, data consistency • Clean indexing syntax

  36. Two types of stencil computation • Concurrent computation • Each tile can be executed independently • Example: Jacobi, MG • Wavefront computation • The execution sequence of tiles follows a certain order. • Example: LU, SSOR • With the update on read policy, minimal number of communications is achieved. computation Shadow region update in every iteration Shadow region update in second iteration computation

More Related