1 / 32

A Multi-platform Co-Array Fortran Compiler

A Multi-platform Co-Array Fortran Compiler. Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA. Motivation. Parallel Programming Models MPI: de facto standard difficult to program

sherri
Download Presentation

A Multi-platform Co-Array Fortran Compiler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA

  2. Motivation Parallel Programming Models • MPI: de facto standard • difficult to program • OpenMP: inefficient to map on distributed memory platforms • lack of locality control • HPF: hard to obtain high-performance • heroic compilers needed! Global address space languages: CAF, Titanium, UPC an appealing middle ground

  3. Co-Array Fortran • Global address space programming model • one-sided communication (GET/PUT) • Programmer has control over performance-critical factors • data distribution • computation partitioning • communication placement • Data movement and synchronization as language primitives • amenable to compiler-based communication optimization

  4. CAF Programming Model Features • SPMD process images • fixed number of images during execution • images operate asynchronously • Both private and shared data • real x(20, 20) a private 20x20 array in each image • real y(20, 20)[*] a shared 20x20 array in each image • Simple one-sided shared-memory communication • x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns • Synchronization intrinsic functions • sync_all – a barrier and a memory fence • sync_mem – a memory fence • sync_team([team members to notify], [team members to wait for]) • Pointers and (perhaps asymmetric) dynamic allocation

  5. integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) a(10,20) a(10,20) One-sided Communication with Co-Arrays image 1 image 2 image N image 1 image 2 image N

  6. Rice Co-Array Fortran Compiler (cafc) • First CAF multi-platform compiler • previous compiler only for Cray shared memory systems • Implements core of the language • currently lacks support for derived type and dynamic co-arrays • Core sufficient for non-trivial codes • Performance comparable to that of hand-tuned MPI codes • Open source

  7. Outline • CAF programming model • cafc • Core language implementation • Optimizations • Experimental evaluation • Conclusions

  8. Implementation Strategy • Source-to-source compilation of CAF codes • uses Open64/SL Fortran 90 infrastructure • CAF  Fortran 90 + communication operations • Communication • ARMCI library for one-sided communication on clusters • load/store communication on shared-memory platforms Goals • portability • high-performance on a wide range of platforms

  9. Co-Array Descriptors • Initialize and manipulate Fortran 90 dope vectors real :: a(10,10,10)[*] type CAFDesc_real_3 integer(ptrkind) :: handle ! Opaque handle ! to CAF runtime representation real, pointer:: ptr(:,:,:) ! Fortran 90 pointer ! to local co-array data end Type CAFDesc_real_3 type(CAFDesc_real_3):: a

  10. Allocating COMMON and SAVE Co-Arrays • Compiler • generates static initializer for each common/save variable • Linker • collects calls to all initializers • generates global initializer that calls all others • compiles global initializer and links into program • Launch • invokes global initializer before main program begins • allocates co-array storage outside Fortran 90 runtime system • associates co-array descriptors with allocated memory Similar to handling for C++ static constructors

  11. Parameter Passing call f((a(I)[p])) • Call-by-value convention (copy-in, copy-out) • pass remote co-array data to procedures only as values • Call-by-co-array convention* • argument declared as a co-array by callee • enables access to local and remote co-array data • Call-by-reference convention* (cafc) • argument declared as an explicit shape array • enables access to local co-array data only • enables reuse of existing Fortran code subroutine f(a) real :: a(10)[*] real :: x(10)[*] call f(x) subroutine f(a) real :: a(10) * requires an explicit interface

  12. Multiple Co-dimensions Managing processors as a logical multi-dimensional grid integer a(10,10)[5,4,*] 3D processor grid 5 x 4 x … • Support co-space reshaping at procedure calls • change number of co-dimensions • co-space bounds as procedure arguments

  13. Implementing Communication x(1:n) = a(1:n)[p] + … • Use a temporary buffer to hold off processor data • allocate buffer • perform GET to fill buffer • perform computation: x(1:n) = buffer(1:n) + … • deallocate buffer • Optimizations • no temporary storage for co-array to co-array copies • load/store communication on shared-memory systems

  14. Synchronization • Original CAF specification: team synchronization only • sync_all, sync_team • Limits performance on loosely-coupled architectures • Point-to-point extensions • sync_notify(q) • sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p all communication from p to q issued before the notify has been delivered to q

  15. Outline • CAF programming model • cafc • Core language implementation • Optimizations • procedure splitting • supporting hints for non-blocking communication • packing strided communications • Experimental evaluation • Conclusions

  16. An Impediment to Code Efficiency • Original reference rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … • Transformed reference rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … • Fortran 90 pointer-based co-array representation does not convey • the lack of co-array aliasing • co-array contiguity • co-array bounds • Lack of knowledge inhibits important code optimizations

  17. Procedure Splitting CAF to CAF preprocessing subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*] ... = c_arg(50) ... end subroutine f_inner subroutine f(…) real, save :: c(100)[*] ... = c(50) ... end subroutine f

  18. Benefits of Procedure Splitting • Generated code conveys • lack of co-array aliasing • co-array contiguity • co-array bounds • Enables back-end compiler to generate better code

  19. Hiding Communication Latency Goal: enable communication/computation overlap • Impediments to generating non-blocking communication • use of indexed subscripts in co-dimensions • lack of whole program analysis • Approach: support hints for non-blocking communication • overcome conservative compiler analysis • enable sophisticated programmers to achieve good performance today

  20. Hints for Non-blocking PUTs • Hints for CAF run-time system to issue non-blocking PUTs region_id = open_nb_put_region() ... Put_Stmt_1 ... Put_Stmt_N ... call close_nb_put_region(region_id) • Complete non-blocking PUTs: call complete_nb_put_region(region_id) • Open problem: Exploiting non-blocking GETs?

  21. Strided vs. Contiguous Transfers • Problem CAF remote reference might induce many small data transfers a(i,1:n)[p] = b(j,1:n) • Solution pack strided data on source and unpack it on destination

  22. Pragmatics of Packing Who should implement packing? • The CAF programmer • difficult to program • The CAF compiler • unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation) • The communication library • most natural place • ARMCI currently performs packing on Myrinet

  23. CAF Compiler Targets (Sept 2004) • Processors • Pentium, Alpha, Itanium2, MIPS • Interconnects • Quadrics, Myrinet, Gigabit Ethernet, shared memory • Operating systems • Linux, Tru64, IRIX

  24. Outline • CAF programming model • cafc • Core language implementation • Optimizations • Experimental evaluation • Conclusions

  25. Experimental Evaluation • Platforms • Alpha+Quadrics QSNet (Elan3) • Itanium2+Quadrics QSNet II (Elan4) • Itanium2+Myrinet 2000 • Codes • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

  26. NAS BT Efficiency (Class C)

  27. NAS SP Efficiency (Class C) lack of non-blocking notify implementation blocks CAF comm/comp overlap

  28. NAS MG Efficiency (Class C) • ARMCI comm is efficient • pt-2-pt synch in boosts • CAF performance 30%

  29. NAS CG Efficiency (Class C)

  30. NAS LU Efficiency (class C)

  31. Impact of Optimizations Assorted Results • Procedure splitting • 42-60% improvement for BT on Itanium2+Myrinet cluster • 15-33% improvement for LU on Alpha+Quadrics • Non-blocking communication generation • 5% improvement for BT on Itanium2+Quadrics cluster • 3% improvement for MG on all platforms • Packing of strided data • 31% improvement for BT on Alpha+Quadrics cluster • 37% improvement for LU on Itanium2+Quadrics cluster See paper for more details

  32. Conclusions • CAF boosts programming productivity • simplifies the development of SPMD parallel programs • shifts details of managing communication to compiler • cafc delivers performance comparable to hand-tuned MPI • cafc implements effective optimizations • procedure splitting • non-blocking communication • packing of strided communication (in ARMCI) • Vectorization needed to achieve true performance portability with machines like Cray X1 http://www.hipersoft.rice.edu/caf

More Related