An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University Francois Cantonnet, Tarek El-Ghazawi, Ashrujit Mohanti, Yiyi Yao George Washington University Daniel Chavarria-Miranda Pacific Northwest National Laboratory

lacking in OpenMP HPF & OpenMP compilers must get this right GAS Languages • Global address space programming model • one-sided communication (GET/PUT) • Programmer has control over performance-critical factors • data distribution and locality control • computation partitioning • communication placement • Data movement and synchronization as language primitives • amenable to compiler-based communication optimization simpler than msg passing

Questions • Can GAS languages match the performance of hand-tuned message passing programs? • What are the obstacles to obtaining performance with GAS languages? • What should be done to ameliorate them? • by language modifications or extensions • by compilers • by run-time systems • How easy is it to develop high performance programs in GAS languages?

Approach Evaluate CAF and UPC using NAS Parallel Benchmarks • Compare performance to that of MPI versions • use hardware performance counters to pinpoint differences • Determine optimization techniques common for both languages as well as language specific optimizations • language features • program implementation strategies • compiler optimizations • runtime optimizations • Assess programmability of the CAF and UPC variants

Outline • Questions and approach • CAF & UPC • Features • Compilers • Performance considerations • Experimental evaluation • Conclusions

CAF & UPC Common Features • SPMD programming model • Both private and shared data • Language-level one-sided shared-memory communication • Synchronization intrinsic functions (barrier, fence) • Pointers and dynamic allocation

CAF & UPC Differences I • Multidimensional arrays • CAF: multidimensional arrays, procedure argument reshaping • UPC: linearization, typically using macros • Local accesses to shared data • CAF: Fortran 90 array syntax without brackets, e.g. a(1:M,N) • UPC: shared array reference using MYTHREAD or a C pointer

CAF and UPC Differences II • Scalar/element-wise remote accesses • CAF: multidimensional subscripts + bracket syntax a(1,1) = a(1,M)[this_image()-1] • UPC: shared (“flat”) array access with linearized subscripts a[N*M*MYTHREAD] = a[N*M*MYTHREAD-N] • Bulk and strided remote accesses • CAF: use natural syntax of Fortran 90 array sections and operations on remote co-array sections (less temporaries on SMPs) • UPC: use library functions (and temporary storage to hold a copy)

M N P1 P2 PN Bulk Communication CAF: integer a(N,M)[*] a(1:N,1:2) = a(1:N,M-1:M)[this_image()-1] UPC: shared int *a; upc_memget(&a[N*M*MYTHREAD], &a[N*M*MYTHREAD-2*N], 2*N*sizeof(int));

CAF & UPC Differences III • Synchronization • CAF: team synchronization • UPC: split-phase barrier, locks • UPC: worksharing construct upc_forall • UPC: richer set of pointer types

Outline • Questions and approach • CAF & UPC • Features • Compilers • Performance considerations • Experimental evaluation • Conclusions

CAF Compilers • Rice Co-Array Fortran Compiler (cafc) • Multi-platform compiler • Implements core of the language • core sufficient for non-trivial codes • currently lacks support for derived type and dynamic co-arrays • Source-to-source translator • translates CAF into Fortran 90 and communication code • uses ARMCI or GASNet as communication substrate • can generate load/store for remote data accesses on SMPs • Performance comparable to that of hand-tuned MPI codes • Open source • Vendor compilers: Cray

UPC Compilers • Berkeley UPC Compiler • Multi-platform compiler • Implements full UPC 1.1 specification • Source-to-source translator • converts UPC into ANSI C and calls to UPC runtime library & GASNet • tailors code to a specific architecture: cluster or SMP • Open source • Intrepid UPC compiler • Based on GCC compiler • Works on SGI Origin, Cray T3E and Linux SMP • Other vendor compilers: Cray, HP

Outline • Motivation and Goals • CAF & UPC • Features • Compilers • Performance considerations • Experimental evaluation • Conclusions

Scalar Performance • Generate code amenable to backend compiler optimizations • Quality of back end compilers • poor reduction recognition in the Intel C compiler • Local access to shared data • CAF: use F90 pointers and procedure arguments • UPC: use C pointers instead of UPC shared pointers • Alias and dependence analysis • Fortran vs. C language semantics • multidimensional arrays in Fortran • procedure argument reshaping • Convey lack of aliasing for (non-aliased) shared variables • CAF: use procedure splitting so co-arrays are referenced as arguments • UPC: use restrict C99 keyword for C pointers used to access shared data

Communication • Communication vectorization is essential for high performance on cluster architectures for both languages • CAF • use F90 array sections (compiler translates to appropriate library calls) • UPC • use library functions for contiguous transfers • use UPC extensions for strided transfer in Berkeley UPC compiler • Increase efficiency of strided transfers by packing/unpacking data at the language level

Synchronization • Barrier-based synchronization • Can lead to over-synchronized code • Use point-to-point synchronization • CAF: proposed language extension (sync_notify, sync_wait) • UPC: language-level implementation

Outline • Questions and approach • CAF & UPC • Experimental evaluation • Conclusions

Platforms and Benchmarks • Platforms • Itanium2+Myrinet 2000 (900 MHz Itanium2) • Alpha+Quadrics QSNetI (1 GHz Alpha EV6.8CB) • SGI Altix 3000 (1.5 GHz Itanium2) • SGI Origin 2000 (R10000) • Codes • NAS Parallel Benchmarks (NPB 2.3) from NASA Ames • MG, CG, SP, BT • CAF and UPC versions were derived from Fortran77+MPI versions

Intel compiler: restrict yields 2.3 time performance improvement CAF point to point 35% faster than barriers UPC strided comm 28% faster than multiple transfers UPC point to point 49% faster than barriers MG class A (2563) on Itanium2+Myrinet2000 Higher is better

Intel C compiler: scalar performance MG class C (5123) on SGI Altix 3000 Fortran compiler: linearized array subscripts 30% slowdown compared to multidimensional subscripts 64 Higher is better

MG class B (2563) on SGI Origin 2000 Higher is better

Intel compiler: sum reductions in C 2.6 times slower than Fortran! point to point 19% faster than barriers CG class C (150000) on SGI Altix 3000 Higher is better

Intrepid compiler (gcc): sum reductions in C is up to 54% slower than SGI C/Fortran! CG class B (75000) on SGI Origin 2000 Higher is better

restrict yields 18% performance improvement SP class C (1623) on Itanium2+Myrinet2000 Higher is better

SP class C (1623) on Alpha+Quadrics Higher is better

CAF: comm. packing 7% faster CAF: procedure splitting improves performance 42-60% UPC: comm. packing 32% faster UPC: use of restrict boosts the performance 43% BT class C (1623) on Itanium2+Myrinet2000 Higher is better

use of restrict improves performance 30% BT class B (1023) on SGI Altix 3000 Higher is better

Conclusions • Matching MPI performance required using bulk communication • library-based primitives are cumbersome in UPC • communicating multi-dimensional array sections is natural in CAF • lack of efficient run-time support for strided communication is a problem • With CAF, can achieve performance comparable to MPI • With UPC, matching MPI performance can be difficult • CG: able to match MPI on all platforms • SP, BT, MG: substantial gap remains

Why the Gap? • Communication layer is not the problem • CAF with ARMCI or GASNet yields equivalent performance • Scalar code optimization of scientific code is the key! • SP+BT: SGI Fortran: unroll+jam, SWP • MG: SGI Fortran: loop alignment, fusion • CG: Intel Fortran: optimized sum reduction • Linearized subscripts for multidimensional arrays hurt! • measured 30% performance gap with Intel Fortran

Programming for Performance • In the absence of effective optimizing compilers for CAF and UPC, achieving high performance is difficult • To make codes efficient across the full range of architectures, we need • better language support for synchronization • point-to-point synchronization is an important common case! • better CAF & UPC compiler support • communication vectorization • synchronization strength reduction • better compiler optimization of loops with complex dependence patterns • better run-time library support • efficient communication of strided array sections

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C