320 likes | 442 Views
Experiences Building a Multi-platform Compiler for Co-array Fortran. John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University. AHPCRC PGAS Workshop September, 2005. Goals for HPC Languages. Expressiveness Ease of programming
E N D
Experiences Building a Multi-platform Compiler for Co-array Fortran John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University AHPCRC PGAS Workshop September, 2005
Goals for HPC Languages • Expressiveness • Ease of programming • Portable performance • Ubiquitous availability
lacking in OpenMP HPF & OpenMP compilers must get this right PGAS Languages • Global address space programming model • one-sided communication (GET/PUT) • Programmer has control over performance-critical factors • data distribution and locality control • computation partitioning • communication placement • Data movement and synchronization as language primitives • amenable to compiler-based communication optimization simpler than msg passing
Co-array Fortran Programming Model • SPMD process images • fixed number of images during execution • images operate asynchronously • Both private and shared data • real x(20, 20) a private 20x20 array in each image • real y(20, 20)[*] a shared 20x20 array in each image • Simple one-sided shared-memory communication • x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns • Synchronization intrinsic functions • sync_all – a barrier and a memory fence • sync_mem – a memory fence • sync_team([team members to notify], [team members to wait for]) • Pointers and (perhaps asymmetric) dynamic allocation • Parallel I/O
integer a(10,20)[*] if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] a(10,20) a(10,20) a(10,20) One-sided Communication with Co-Arrays image 1 image 2 image N image 1 image 2 image N
CAF Compilers • Cray compilers for X1 & T3E architectures • Rice Co-Array Fortran Compiler (cafc)
Rice cafc Compiler • Source-to-source compiler • source-to-source yields multi-platform portability • Implements core language features • core sufficient for non-trivial codes • preliminary support for derived types • soon support for allocatable components • Open source Performance comparable to that of hand-tuned MPI codes
Implementation Strategy • Goals • portability • high performance on a wide range of platforms • Approach • source-to-source compilation of CAF codes • use Open64/SL Fortran 90 infrastructure • CAF Fortran 90 + communication operations • communication • ARMCI and GASNet one-sided comm libraries for portability • load/store communication on shared-memory platforms
Key Implementation Concerns • Fast access to local co-array data • Fast communication • Overlap of communication and computation
Accessing Co-Array Data Two Representations • SAVE and COMMON co-arrays as Fortran 90 pointers • F90 pointers to memory allocated outside Fortran run-time system • original references accessing local co-array data • rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - … • transformed references • rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - … • Procedure co-array arguments as F90 explicit-shape arrays • CAF language requires explicit shape for co-array arguments real :: a(10,10,10)[*] type CAFDesc_real_3 real, pointer:: ptr(:,:,:) ! F90 pointer to local co-array data end Type CAFDesc_real_3 type(CAFDesc_real_3):: a
Performance Challenges • Problem • Fortran 90 pointer-based representation does not convey • the lack of co-array aliasing • contiguity of co-array data • co-array bounds information • lack of knowledge inhibits important code optimizations • Approach: procedure splitting
Procedure Splitting CAF to CAF optimization subroutine f(…) real, save :: c(100)[*] interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_inner end interface call f_inner(…,c(1)) end subroutine f subroutine f_inner(…, c_arg) real :: c_arg(100)[*] ... = c_arg(50) ... end subroutine f_inner subroutine f(…) real, save :: c(100)[*] ... = c(50) ... end subroutine f • Benefits • better alias analysis • contiguity of co-array data • co-array bounds information • better dependence analysis result: back-end compiler can generate better code
Implementing Communication • x(1:n) = a(1:n)[p] + … • General approach: use buffer to hold off processor data • allocate buffer • perform GET to fill buffer • perform computation: x(1:n) = buffer(1:n) + … • deallocate buffer • Optimizations • no buffer for co-array to co-array copies • unbuffered load/store on shared-memory systems
Strided vs. Contiguous Transfers • Problem • CAF remote reference might induce many small data transfers • a(i,1:n)[p] = b(j,1:n) • Solution • pack strided data on source and unpack it on destination • Constraints • can’t express both source-level packing and unpacking for a one-sided transfer • two-sided packing/unpacking is awkward for users • Preferred approach • have communication layer perform packing/unpacking
Pragmatics of Packing Who should implement packing? • CAF programmer • difficult to program • CAF compiler • must convert PUTs into two-sided communication to unpack • difficult whole-program transformation • Communication library • most natural place • ARMCI currently performs packing on Myrinet (at least)
Synchronization • Original CAF specification: team synchronization only • sync_all, sync_team • Limits performance on loosely-coupled architectures • Point-to-point extensions • sync_notify(q) • sync_wait(p) Point to point synchronization semantics Delivery of a notify to q from p all communication from p to q issued before the notify has been delivered to q
Hiding Communication Latency Goal: enable communication/computation overlap • Impediments to generating non-blocking communication • use of indexed subscripts in co-dimensions • lack of whole program analysis • Approach: support hints for non-blocking communication • overcome conservative compiler analysis • enable sophisticated programmers to achieve good performance today
Questions about PGAS Languages • Performance • can performance match hand-tuned msg passing programs? • what are the obstacles to top performance? • what should be done to overcome them? • language modifications or extensions? • program implementation strategies? • compiler technology? • run-time system enhancements? • Programmability • how easy is it to develop high performance programs?
Investigating these Issues Evaluate CAF, UPC, and MPI versions of NAS benchmarks • Performance • compare CAF and UPC performance to that of MPI versions • use hardware performance counters to pinpoint differences • determine optimization techniques common for both languages as well as language specific optimizations • language features • program implementation strategies • compiler optimizations • runtime optimizations • Programmability • assess programmability of the CAF and UPC variants
Platforms and Benchmarks • Platforms • Itanium2+Myrinet 2000 (900 MHz Itanium2) • Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) • SGI Altix 3000 (1.5 GHz Itanium2) • SGI Origin 2000 (R10000) • Codes • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames • MG, CG, SP, BT • CAF and UPC versions were derived from Fortran77+MPI versions
Intel compiler: restrict yields factor of 2.3 performance improvement CAF point to point 35% faster than barriers UPC strided comm 28% faster than multiple transfers UPC point to point 49% faster than barriers MG class A (2563) on Itanium2+Myrinet2000 Higher is better
Intel C compiler: scalar performance MG class C (5123) on SGI Altix 3000 Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts 64 Higher is better
MG class B (2563) on SGI Origin 2000 Higher is better
Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers CG class C (150000) on SGI Altix 3000 Higher is better
Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! CG class B (75000) on SGI Origin 2000 Higher is better
restrict yields 18% performance improvement SP class C (1623) on Itanium2+Myrinet2000 Higher is better
SP class C (1623) on Alpha+Quadrics Higher is better
CAF: comm. packing 7% faster CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster UPC: use of restrict boosts the performance 43% BT class C (1623) on Itanium2+Myrinet2000 Higher is better
use of restrict improves performance 30% BT class B (1023) on SGI Altix 3000 Higher is better
Performance Observations • Achieving highest performance can be difficult • need effective optimizing compilers for PGAS languages • Communication layer is not the problem • CAF with ARMCI or GASNet yields equivalent performance • Scalar code optimization of scientific code is the key! • SP+BT: SGI Fortran: unroll+jam, SWP • MG: SGI Fortran: loop alignment, fusion • CG: Intel Fortran: optimized sum reduction • Linearized subscripts for multidimensional arrays hurt! • measured 30% performance gap with Intel Fortran
Performance Prescriptions For portable high performance, we need … • Better language support for CAF synchronization • point-to-point synchronization is an important common case! • currently only a Rice extension outside the CAF standard • Better CAF & UPC compiler support • communication vectorization • synchronization strength reduction: important for programmability • Compiler optimization of loops with complex dependences • Better run-time library support • efficient communication support for strided array sections
Programmability Observations • Matching MPI performance required using bulk communication • communicating multi-dimensional array sections is natural in CAF • library-based primitives are cumbersome in UPC • Strided communication is problematic for performance • tedious programming of packing/unpacking at src level • Wavefront computations • MPI buffered communication easily decouples sender/receiver • PGAS models: buffering explicitly managed by programmer