360 likes | 501 Views
A Multi-platform Co-Array Fortran Compiler. Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA. Motivation. Parallel Programming Models MPI: de facto standard difficult to program
E N D
A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA
Motivation Parallel Programming Models • MPI: de facto standard • difficult to program • OpenMP: inefficient to map on distributed memory platforms • lack of locality control • HPF: hard to obtain high-performance • heroic compilers needed! Global address space languages: CAF, Titanium, UPC an appealing middle ground
Co-Array Fortran • Global address space programming model • one-sided communication (GET/PUT) • Programmer has control over performance-critical factors • data distribution • computation partitioning • communication placement • Data movement and synchronization as language primitives • amenable to compiler-based communication optimization
CAF Programming Model Features • SPMD process images • fixed number of images during execution • images operate asynchronously • Both private and shared data • real x(20, 20) a private 20x20 array in each image • real y(20, 20)[*] a shared 20x20 array in each image • Simple one-sided shared-memory communication • x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns • Synchronization intrinsic functions • sync_all – a barrier and a memory fence • sync_mem – a memory fence • sync_team([team members to notify], [team members to wait for]) • Pointers and (perhaps asymmetric) dynamic allocation
integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) a(10,20) a(10,20) One-sided Communication with Co-Arrays image 1 image 2 image N image 1 image 2 image N
Rice Co-Array Fortran Compiler (cafc) • First CAF multi-platform compiler • previous compiler only for Cray shared memory systems • Implements core of the language • currently lacks support for derived type and dynamic co-arrays • Core sufficient for non-trivial codes • Performance comparable to that of hand-tuned MPI codes • Open source
Outline • CAF programming model • cafc • Core language implementation • Optimizations • Experimental evaluation • Conclusions
Implementation Strategy • Source-to-source compilation of CAF codes • uses Open64/SL Fortran 90 infrastructure • CAF Fortran 90 + communication operations • Communication • ARMCI library for one-sided communication on clusters • load/store communication on shared-memory platforms Goals • portability • high-performance on a wide range of platforms
Co-Array Descriptors • Initialize and manipulate Fortran 90 dope vectors real :: a(10,10,10)[*] type CAFDesc_real_3 integer(ptrkind) :: handle ! Opaque handle ! to CAF runtime representation real, pointer:: ptr(:,:,:) ! Fortran 90 pointer ! to local co-array data end Type CAFDesc_real_3 type(CAFDesc_real_3):: a
Allocating COMMON and SAVE Co-Arrays • Compiler • generates static initializer for each common/save variable • Linker • collects calls to all initializers • generates global initializer that calls all others • compiles global initializer and links into program • Launch • invokes global initializer before main program begins • allocates co-array storage outside Fortran 90 runtime system • associates co-array descriptors with allocated memory Similar to handling for C++ static constructors
Parameter Passing call f((a(I)[p])) • Call-by-value convention (copy-in, copy-out) • pass remote co-array data to procedures only as values • Call-by-co-array convention* • argument declared as a co-array by callee • enables access to local and remote co-array data • Call-by-reference convention* (cafc) • argument declared as an explicit shape array • enables access to local co-array data only • enables reuse of existing Fortran code subroutine f(a) real :: a(10)[*] real :: x(10)[*] call f(x) subroutine f(a) real :: a(10) * requires an explicit interface
Multiple Co-dimensions Managing processors as a logical multi-dimensional grid integer a(10,10)[5,4,*] 3D processor grid 5 x 4 x … • Support co-space reshaping at procedure calls • change number of co-dimensions • co-space bounds as procedure arguments
Implementing Communication x(1:n) = a(1:n)[p] + … • Use a temporary buffer to hold off processor data • allocate buffer • perform GET to fill buffer • perform computation: x(1:n) = buffer(1:n) + … • deallocate buffer • Optimizations • no temporary storage for co-array to co-array copies • load/store communication on shared-memory systems
Synchronization • Original CAF specification: team synchronization only • sync_all, sync_team • Limits performance on loosely-coupled architectures • Point-to-point extensions • sync_notify(q) • sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p all communication from p to q issued before the notify has been delivered to q
Outline • CAF programming model • cafc • Core language implementation • Optimizations • procedure splitting • supporting hints for non-blocking communication • packing strided communications • Experimental evaluation • Conclusions
An Impediment to Code Efficiency • Original reference rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … • Transformed reference rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … • Fortran 90 pointer-based co-array representation does not convey • the lack of co-array aliasing • co-array contiguity • co-array bounds • Lack of knowledge inhibits important code optimizations
Procedure Splitting CAF to CAF preprocessing subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*] ... = c_arg(50) ... end subroutine f_inner subroutine f(…) real, save :: c(100)[*] ... = c(50) ... end subroutine f
Benefits of Procedure Splitting • Generated code conveys • lack of co-array aliasing • co-array contiguity • co-array bounds • Enables back-end compiler to generate better code
Hiding Communication Latency Goal: enable communication/computation overlap • Impediments to generating non-blocking communication • use of indexed subscripts in co-dimensions • lack of whole program analysis • Approach: support hints for non-blocking communication • overcome conservative compiler analysis • enable sophisticated programmers to achieve good performance today
Hints for Non-blocking PUTs • Hints for CAF run-time system to issue non-blocking PUTs region_id = open_nb_put_region() ... Put_Stmt_1 ... Put_Stmt_N ... call close_nb_put_region(region_id) • Complete non-blocking PUTs: call complete_nb_put_region(region_id) • Open problem: Exploiting non-blocking GETs?
Strided vs. Contiguous Transfers • Problem CAF remote reference might induce many small data transfers a(i,1:n)[p] = b(j,1:n) • Solution pack strided data on source and unpack it on destination
Pragmatics of Packing Who should implement packing? • The CAF programmer • difficult to program • The CAF compiler • unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation) • The communication library • most natural place • ARMCI currently performs packing on Myrinet
CAF Compiler Targets (Sept 2004) • Processors • Pentium, Alpha, Itanium2, MIPS • Interconnects • Quadrics, Myrinet, Gigabit Ethernet, shared memory • Operating systems • Linux, Tru64, IRIX
Outline • CAF programming model • cafc • Core language implementation • Optimizations • Experimental evaluation • Conclusions
Experimental Evaluation • Platforms • Alpha+Quadrics QSNet (Elan3) • Itanium2+Quadrics QSNet II (Elan4) • Itanium2+Myrinet 2000 • Codes • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
NAS SP Efficiency (Class C) lack of non-blocking notify implementation blocks CAF comm/comp overlap
NAS MG Efficiency (Class C) • ARMCI comm is efficient • pt-2-pt synch in boosts • CAF performance 30%
Impact of Optimizations Assorted Results • Procedure splitting • 42-60% improvement for BT on Itanium2+Myrinet cluster • 15-33% improvement for LU on Alpha+Quadrics • Non-blocking communication generation • 5% improvement for BT on Itanium2+Quadrics cluster • 3% improvement for MG on all platforms • Packing of strided data • 31% improvement for BT on Alpha+Quadrics cluster • 37% improvement for LU on Itanium2+Quadrics cluster See paper for more details
Conclusions • CAF boosts programming productivity • simplifies the development of SPMD parallel programs • shifts details of managing communication to compiler • cafc delivers performance comparable to hand-tuned MPI • cafc implements effective optimizations • procedure splitting • non-blocking communication • packing of strided communication (in ARMCI) • Vectorization needed to achieve true performance portability with machines like Cray X1 http://www.hipersoft.rice.edu/caf