10 likes | 136 Views
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi, johnmc}@cs.rice.edu. NSF. a(10,20). a(10,20). a(10,20). image 1. image 2. image N. image 1. image 2. image N. integer :: a(10,20)[*].
E N D
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi, johnmc}@cs.rice.edu NSF a(10,20) a(10,20) a(10,20) image 1 image 2 image N image 1 image 2 image N integer :: a(10,20)[*] Copies from left neighbor if (this_image() > 1)a(1:10,1:2)=a(1:10,19:20)[this_image()-1] 3 2 1 2 1 1 2D wave-front parallelism Programming Ultra-scale Parallel Systems Co-Array Fortran Language CAF Model Refinements Rice CAF Compiler • Parallel extension of Fortran 90 • SPMD programming model • fixed number of images during execution • images operate asynchronously • Both private and shared data • real a(20,20)private: a 20x20 array in each image • real a(20,20) [*] shared: a 20x20 array in each image • Simple one-sided communication (PUT & GET) • x(:,j:j+2) = a(r,:) [p:p+2] copy rows from p:p+2 intolocal columns • Flexible explicit synchronization • sync_team(team [,wait]) • team= a vector of process ids to synchronize with • wait=a vector of processes to wait for • Pointers and dynamic allocation • Point-to-point synchronization • sync_notify(p) • sync_wait(p) • Less restrictive memory fences at call site • Collective operations • Source-to-source code generation • Open source compiler • Build on Open64/SL infrastructure • Support for core language features • Code generation: • library-based communication: portable ARMCI and GASNet communication libraries and array descriptor CHASM library • load/store communication: on shared-memory platforms • Operating systems: • Linux IA64/IA32 • Alpha Tru64 • SGI IRIX64 • Interconnects & Platforms: • Quadrics QSNet (Elan 3), QSNet II (Elan 4) • Myrinet 2000 • Ethernet • SGI Altix 3000, SGI Origin 2000 • CHALLENGES • High-performance and good scalability • Programmer productivity • CAF: a promising near-term alternative • As expressive as MPI • Simpler to program than MPI • More amenable to compiler optimizations • User has control over performance-critical factors • MPI: a library-based parallel programming model • Portable and widely used • The developer has explicit control over data and communication placement • Difficult and error prone to program • Most of the burden for communication optimization falls on application developers; compiler support is underutilized Current Optimizations • Procedure Splitting • Hints for non-blocking communication • Library-based and load/store communication • Packing of strided communication Planned Optimizations • Communication vectorization and aggregation • Synchronization strength-reduction • Automatic split-phase communication • Platform-driven communication optimizations • transform communication from one-sided into two- sided and collective, if useful • multi-model code for hierarchical architectures • convert GETs into PUTs • Multi-buffer co-arrays for asynchrony tolerance • Employ virtualization for latency tolerance • Interoperability with other programming models CAF Applications and Benchmarks • Sweep3D – wave-front parallelism • Spark98 – sparse matrix vector multiply • NAS Parallel Benchmarks 2.3: MG, CG, SP, BT, LU • Random Access, STREAM San Fernando Valley Earthquake Simulation – Spark98 Neutron transport problem – Sweep3D Computational Fluid Dynamics: Cluster Platforms NAS BT C on Itanium2+Myrinet 2000 NAS MG C on Itanium2+Myrinet 2000 Spark98 on SGI Altix 3000 Sweep3D 1503on Itanium2+Quadrics Mesh Computational Fluid Dynamics: SGI Altix 3000 • Sparse matrix vector multiply (sf2 traces) • Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs • CAF GETs is simple and more “natural” to code, but up to 13% slower • Without considering locality, applications do not scale on NUMA architectures (Hybrid) • ARMCI library is more efficient than MPI Sweep3D 1503on Itanium2+Myrinet Sweep3D 1503on SGI Altix 3000 NAS BT B on SGI Altix 3000 NAS MG C on SGI Altix 3000 Partitioned Mesh