Co-array Fortran: Compilation, Performance, Languages Issues

1. Co-array Fortran: Compilation, Performance, Languages Issues John Mellor-Crummey Cristian Coarfa Yuri Dotsenko Department of Computer Science Rice University

2. Outline Co-array Fortran language recap Compilation approach Co-array storage management Communication A preliminary performance study Platforms Benchmarks and results and lessons Language refinement issues Conclusions

3. CAF Language Assessment Strengths offloads communication management to the compiler choreographing data transfer managing mechanics of synchronization gives user full control of parallelization data movement and synchronization as language primitives amenable to compiler optimization array syntax supports natural user-level vectorization modest compiler technology can yield good performance more abstract than MPI ? better performance portability Weaknesses user manages partitioning of work user specifies data movement user codes necessary synchronization

4. Compiler Goals Portable compiler Multi-platform code generation High performance generated code

5. Compilation Approach Source-to-source Translation Translate CAF into Fortran 90 + communication calls One-sided communication layer strided communication gather/scatter synchronization: barriers, notify/wait split phase non-blocking primitives Today: ARMCI: remote memory copy interface (Nieplocha @ PNL) Benefits wide portability leverage vendor F90 compilers for good node performance

6. Co-array Data Co-array representation F90 pointer to data + opaque handle for communication layer Co-array access read/write local co-array data using F90 pointer dereference remote accesses translate into ARMCI GET/PUT calls Co-array allocation storage allocation by communication layer, as appropriate on shared memory hardware: in shared memory segment on Myrinet: in pinned memory for direct DMA access dope vector initialization using CHASM (Rasmussen @ LANL) set F90 pointer to point to externally managed memory Develop compiler analysis and code generation technology to choreograph communication and computation in parallel programs to deliver maximum performance Develop compiler analysis and code generation technology to choreograph communication and computation in parallel programs to deliver maximum performance

7. Allocating Static Co-arrays (COMMON/SAVE) Compiler: generate static initializer for each common/save variable Linker collect calls to all initializers generate global initializer that calls all others compile global initializer and link into program Launch call global initializer before main program begins

8. COMMON Block Sequence Association Problem each procedure may have a different view of a common Solution allocate a contiguous pool of co-array storage per common each procedure has a private set of view variables (F90 pointers) initialize all per procedure view variables only once at launch after common allocation

9. Porting to a new Compiler / Architecture Synthesize dope vectors for co-array storage compiler/architecture specific details: CHASM library Tailor communication to architecture design supports alternate communication libraries status today: ARMCI (PNL) ongoing work: compiler tailored communication direct load/store on shared-memory architectures future other portable libraries (e.g. GASnet) custom communication library for an architecture

10. Supporting Multiple Co-dimensions A(:,:)[N,M,*] Add precomputed coefficients to co-array meta-data Lower, upper bounds for each co-dimension this_image_cache for each co-dimension e.g., this_image(a,1) yields my co-row index cum_hyperplane_size for each co-dimension

11. Implementing Communication Given a statement X(1:n) = A(1:n)[p] + � A temporary buffer is used for off processor data invoke communication library to allocate tmp in suitable temporary storage dope vector filled in so tmp can be accessed as F90 pointer call communication library to fill in tmp (ARMCI GET) X(1:n) = tmp(1:n) + � deallocate tmp

13. Supported Features Declarations co-objects: scalars and arrays COMMON and SAVE co-objects of primitive types INTEGER(4), REAL(4) and REAL(8) COMMON blocks: variables and co-objects intermixed co-objects with multiple co-dimensions procedure interface blocks with co-array arguments Executable code array section notation for co-array data indices local and remote co-arrays co-array argument passing co-array dummy arguments require explicit interface co-array pointer + communication handle co-array reshaping supported CAF intrinsics Image inquiry: this_image(�), num_images() Synchronization: sync_all, sync_team, synch_notify, synch_wait

14. Coming Attractions Allocatable co-arrays REAL(8), ALLOCATABLE :: X(:)[*] ALLOCATE(X(MYX_NUM)[*]) Co-arrays of user-defined types Allocatable co-array components user defined type with pointer components Triplets in co-dimensions A(j,k)[p+1:p+4]

16. A Preliminary Performance Study Platforms Alpha+Quadrics QSNet (Elan3) Itanium2+Quadrics QSNet II (Elan4) Itanium2+Myrinet 2000 Codes NAS Parallel Benchmarks (NPB) from NASA Ames

17. Alpha+Quadrics Platform (Lemieux) Nodes: 750 Compaq AlphaServer ES35 4-way ES45 1-GHz Alpha EV6.8 (21264C), 64KB/8MB L1/L2 cache 4 GB RAM/node Interconnect: Quadrics QSNet (Elan3) 340 MB/s peak and 210 MB/s sustained x 2 rails Operating System: Tru64 Unix5.1A SC2.5 Compiler: HP Fortran Compiler V5.5A Communication Middleware: ARMCI 1.1-beta

18. Itanium2+Quadrics Platform (PNNL) Nodes: 944 HP Long�s Peak dual-CPU workstations 1.5GHz Itanium2 32KB/256KB/6MB L1/L2/L3 cache 6GB RAM/node Interconnect: Quadrics QSNet II 905 MB/s Operating System: Red Hat Linux, 2.4.20 Compiler: Intel Fortran Compiler v7.1 Communication Middleware: ARMCI 1.1-beta

19. Itanium2+Myrinet Platform (Rice) Nodes: 96 HP zx6000 dual-CPU workstations 900MHz Itanium2 32KB/256KB/1.5MB L1/L2/L3 cache 4GB RAM/node Interconnect: Myrinet 2000 240 MB/s GM version 1.6.5 MPICH-GM version 1.2.5 Operating System: Red Hat Linux, 2.4.18 + patches Compiler: Intel Fortran Compiler v7.1 Communication Middleware: ARMCI 1.1-beta

20. NAS Parallel Benchmarks (NPB) 2.3 Benchmarks by NASA Ames 2-3K lines each (Fortran 77) Widely used to test parallel compiler performance NAS versions: NPB2.3b2 : Hand-coded MPI NPB2.3-serial : Serial code extracted from MPI version Our version NPB2.3-CAF: CAF implementation, based on the MPI version Caf version based on the mpi versions Preserve parallelizationCaf version based on the mpi versions Preserve parallelization

21. NAS BT Block tridiagonal solve of 3D Navier Stokes Dense matrix Parallelization: alternating line sweeps along 3 dimensions multipartitioning data distribution for full parallelism MPI implementation asynchronous send/receive communication/computation overlap CAF communication strided blocks transferred using vector PUTs (triplet notation) no user-declared communication buffers Large messages, relatively infrequent communication

22. NAS BT Efficiency (Class C)

23. NAS SP Scalar pentadiagonal solve of 3D Navier Stokes Dense matrix Parallelization: alternating line sweeps along 3 dimensions multipartitioning data distribution for full parallelism MPI implementation asynchronous send/receive communication/computation overlap CAF communication pack into buffer; separate buffer for each plane of sweep transfer using PUTs smaller more frequent messages; 1.5x communication of BT

24. NAS SP Efficiency (Class C)

25. NAS MG 3D Multigrid solver with periodic boundary conditions Dense matrix Grid size and levels are compile time constants Communication nearest neighbor with possibly 6 neighbors MPI asynchronous send/receive CAF pairwise synch_notify/wait to coordinate with neighbors four communication buffers (co-arrays) used: 2 sender, 2 receiver pack and transfer contiguous data using PUTS for each dimension notify my neighbors that my buffers are free wait for my neighbors to notify me their buffers are free PUT data into right buffer, notify neighbor PUT data into left buffer, notify neighbor wait for both to complete

26. NAS MG Efficiency (Class C)

27. NAS LU Solve 3D Navier Stokes using SSOR Dense matrix Parallelization on power of 2 processors repeated decompositions on x and y until all processors assigned wavefront parallelism; small messages 5 words each MPI implementation asynchronous send/receive communication/computation overlap CAF two dimensional co-arrays morphed code to pack data for higher communication efficiency uses PUTs

28. NAS LU Efficiency (Class C)

29. NAS CG Conjugant gradient solve to compute eigenvector of large, sparse, symmetric, positive definite matrix MPI Irregular point-to-point messaging CAF: structure follows MPI Irregular notify/wait vector assignments for data transfer No communication/computation overlap for either

30. NAS CG Efficiency (Class C) Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait

31. CAF GET vs. PUT Communication Definitions GET: q_caf(n1:n2) = w(m1:m2)[reduce_exch_proc_noncaf(i)] PUT: q_caf(n1:n2)[reduce_exch_proc_noncaf(i)] = w(m1:m2) Study 64 procs, NAS CG class C Alpha+Quadrics Elan3 (Lemieux) Performance GET: 12.9% slower than MPI PUT: 4.0% slower than MPI

32. Experiments Summary On cluster-based architectures, to achieve best performance with CAF, a user or compiler must vectorize (and perhaps aggregate) communication reduce synchronization strength replace all-to-all with point-to-point where sensible overlap communication with computation convert GETS into PUTS where gets are not a h/w primitive consider memory layout conflicts: co-array vs. regular data generate code amenable for back-end compiler optimizations CAF language: many optimizations possible at the source level Compiler optimizations NECESSARY for portable coding style might need user hints where synchronization analysis falls short Runtime issues on Myrinet pin co-array memory for direct transfers

33. CAF Language Refinement Issues Initial implementations on Cray T3E and X1 led to features not suited for distributed memory platforms Key problems and solution suggestions Restrictive memory fence semantics for procedure calls pragmas to enable programmer to overlap one-sided communication with procedure calls Overly restrictive synchronization primitives add unidirectional, point-to-point synchronization rework team model (next slide) No collective operations Leads to home-brew non-portable implementations add CAF intrinsics for reductions, broadcast, etc.

34. CAF Language Refinement Issues CAF dynamic teams lead to don�t scale pre-arranged �communicator-like� teams would help collectives: O(log P) rather than O(P2) reordering logical numbering of images for topology add shape information to image teams? Blocking communication reduces scalability user mechanisms to delay completion to enable overlap? Synchronization is not paired with data movement synchronization hint tags to help analysis synchronization tags at run-time to track completion? How relaxed should the memory model be for performance?

35. Conclusions Tuned CAF performance is comparable to tuned MPI even without compiler-based communication optimizations! CAF programming model enables source-level optimization communication vectorization synchronization strength reduction achieve performance today rather than waiting for tomorrow�s compilers CAF is amenable to compiler analysis and optimization significant communication optimization is feasible, unlike for MPI optimizing compilers will help a wider range of programs achieve high performance applications can be tailored to fully exploit architectural characteristics e.g., shared memory vs. distributed memory vs. hybrid However, more abstract programming models would simplify code development (e.g. HPF)

36. Project URL http://www.hipersoft.rice.edu/caf

37. Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait




41. Parallel Programming Models Goals: Expressiveness Ease of use Performance Portability Current models: OpenMP: difficult to map on distributed memory platforms HPF: difficult to obtain high-performance on broad range of programs MPI: de facto standard; hard to program, assumptions about communication granularity are hard coded UPC: global address space language; similar to CAF but with location transparency

42. Finite Element Example (Numrich) subroutine assemble(start, prin, ghost, neib, x) integer :: start(:), prin(:), ghost(:), neib(:), k1, k2, p real :: x(:) [*] call sync_all(neib) do p = 1, size(neib) ! Add contributions from neighbors k1 = start(p); k2 = start(p+1)-1 x(prin(k1:k2)) = x(prin(k1:k2)) + x(ghost(k1:k2)) [neib(p)] enddo call sync_all(neib) do p = 1, size(neib) ! Update the neighbors k1 = start(p); k2 = start(p+1)-1 x(ghost(k1:k2)) [neib(p)] = x(prin(k1:k2)) enddo call synch_all end subroutine assemble This slide presents a gather-scatter operation; this code is part of an irregular application what I want to show is that it can be expressed compactly in co-array fortran (as you can see it all fits on one slide); that�s an example of the language expressivenessThis slide presents a gather-scatter operation; this code is part of an irregular application what I want to show is that it can be expressed compactly in co-array fortran (as you can see it all fits on one slide); that�s an example of the language expressiveness

43. Communicating Private Data Example REAL:: A(100,100)[*], B(100) A(:,j)[p] = B(:) Issue B is a local array B is sent to a partner Will require a copy into shared space before transfer For higher efficiency want B in shared storage Alternatives Declare communicated arrays as co-arrays Add a communicated attribute to B�s declaration mark it for allocation in shared storage

44. Passing Co-arrays as Arguments Language restriction: pass co-arrays by whole array REAL :: A(100,100)[*] CALL FOO(A) Callee must declare an explicit subroutine interface Proposed option: F90 assumed shape co-array arguments Allow passing of Fortran 90 style array sections of local co-array REAL :: A(100,100)[*] CALL FOO(A(1:10:2,3:25)) Callee must declare an explicit subroutine interface If matching dummy argument is declared as a co-array, then Must declare assumed size data dimensions Must declare assumed size co-dimensions Avoids copy-in, copy-out for co-array data

45. Co-array Fortran (CAF) Explicitly-parallel extension of Fortran 90/95 defined by Numrich & Reid Global address space SPMD parallel programming model one-sided communication Simple, two-level memory model for locality management local vs. remote memory Programmer control over performance critical decisions data partitioning communication Suitable for mapping to a range of parallel architectures shared memory, message passing, hybrid, PIM The Co-Array Fortran (abbreviated CAF) language is an Explicitly-parallel extension of Fortran 90/95, developed by Numrich & Reid; it proposes a global address space SPMD parallel programming model with one-sided communication CAF uses a simple, two-level model that supports locality management; namely it distinguishes between local and remote memory In CAF the programmer has control over decisions such as data partitioning and communication One of the goals of CoArray Fortran is portable performance; the language is suitable for a wide range Of parallel architectures, such as shared memory, message passing, clusters of smps and PIM Belongs to the same language family as Uinifed Parallel C (UPC) and Titanium The Co-Array Fortran (abbreviated CAF) language is an Explicitly-parallel extension of Fortran 90/95, developed by Numrich & Reid; it proposes a global address space SPMD parallel programming model with one-sided communication CAF uses a simple, two-level model that supports locality management; namely it distinguishes between local and remote memory In CAF the programmer has control over decisions such as data partitioning and communication One of the goals of CoArray Fortran is portable performance; the language is suitable for a wide range Of parallel architectures, such as shared memory, message passing, clusters of smps and PIM Belongs to the same language family as Uinifed Parallel C (UPC) and Titanium

46. CAF Programming Model Features SPMD process images fixed number of images during execution images operate asynchronously Both private and shared data real x(20, 20) a private 20x20 array in each image real y(20, 20) [*] a shared 20x20 array in each image Simple one-sided shared-memory communication x(:,j:j+2) = y(:,p:p+2) [r] copy columns from p:p+2 into local columns Synchronization intrinsic functions sync_all � a barrier and a memory fence sync_mem � a memory fence sync_team([notify], [wait]) notify = a vector of process ids to signal wait = a vector of process ids to wait for, a subset of notify Pointers and (perhaps asymmetric) dynamic allocation Parallel I/O A running CAF program consists of a fixed number of process images that operate asynchronously The locality of the data is made explicit by the language; x is declared as a private 20x20 array in each image Y,by the use of the [], is defined as a shared 20x20 array in each image Communication is realized by means of the co-arrays, using the bracket notation to specify remote image numbers A running CAF program consists of a fixed number of process images that operate asynchronously The locality of the data is made explicit by the language; x is declared as a private 20x20 array in each image Y,by the use of the [], is defined as a shared 20x20 array in each image Communication is realized by means of the co-arrays, using the bracket notation to specify remote image numbers

Co-array Fortran: Compilation, Performance, Languages Issues

Co-array Fortran: Compilation, Performance, Languages Issues

Presentation Transcript

Co-Occurring Service Array

Array Synthesis in SystemC Hardware Compilation

Introduction to Co-Array Fortran

Introduction to Compilation of Functional Languages

Experiences with Co-array Fortran on Hardware Shared Memory Platforms

HPF (High Performance Fortran)

Fortran: Array Features

Compiling High Performance Fortran

Introduction to Co-Array Fortran

Disk Array Performance Estimation

High Performance Fortran (HPF)

High Performance Fortran (HPF)

A Multi-platform Co-Array Fortran Compiler

Co-array Fortran Performance and Potential: an NPB Experimental Study

Experiences Building a Multi-platform Compiler for Co-array Fortran

Comparison of Array Operation Synthesis and Straightforward Compilation

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C

Comments on Co-Array Fortran

Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages

Session 10: Data Compilation Issues