1 / 45

Co-array Fortran: Compilation, Performance, Languages Issues

Outline. Co-array Fortran language recapCompilation approachCo-array storage managementCommunicationA preliminary performance studyPlatformsBenchmarks and results and lessonsLanguage refinement issuesConclusions. CAF Language Assessment. Strengths offloads communication management to the compilerchoreographing data transfermanaging mechanics of synchronizationgives user full control of parallelizationdata movement and synchronization as language primitivesamenable to compiler opti32354

nadda
Download Presentation

Co-array Fortran: Compilation, Performance, Languages Issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Co-array Fortran: Compilation, Performance, Languages Issues John Mellor-Crummey Cristian Coarfa Yuri Dotsenko Department of Computer Science Rice University

    2. Outline Co-array Fortran language recap Compilation approach Co-array storage management Communication A preliminary performance study Platforms Benchmarks and results and lessons Language refinement issues Conclusions

    3. CAF Language Assessment Strengths offloads communication management to the compiler choreographing data transfer managing mechanics of synchronization gives user full control of parallelization data movement and synchronization as language primitives amenable to compiler optimization array syntax supports natural user-level vectorization modest compiler technology can yield good performance more abstract than MPI ? better performance portability Weaknesses user manages partitioning of work user specifies data movement user codes necessary synchronization

    4. Compiler Goals Portable compiler Multi-platform code generation High performance generated code

    5. Compilation Approach Source-to-source Translation Translate CAF into Fortran 90 + communication calls One-sided communication layer strided communication gather/scatter synchronization: barriers, notify/wait split phase non-blocking primitives Today: ARMCI: remote memory copy interface (Nieplocha @ PNL) Benefits wide portability leverage vendor F90 compilers for good node performance

    6. Co-array Data Co-array representation F90 pointer to data + opaque handle for communication layer Co-array access read/write local co-array data using F90 pointer dereference remote accesses translate into ARMCI GET/PUT calls Co-array allocation storage allocation by communication layer, as appropriate on shared memory hardware: in shared memory segment on Myrinet: in pinned memory for direct DMA access dope vector initialization using CHASM (Rasmussen @ LANL) set F90 pointer to point to externally managed memory Develop compiler analysis and code generation technology to choreograph communication and computation in parallel programs to deliver maximum performance Develop compiler analysis and code generation technology to choreograph communication and computation in parallel programs to deliver maximum performance

    7. Allocating Static Co-arrays (COMMON/SAVE) Compiler: generate static initializer for each common/save variable Linker collect calls to all initializers generate global initializer that calls all others compile global initializer and link into program Launch call global initializer before main program begins

    8. COMMON Block Sequence Association Problem each procedure may have a different view of a common Solution allocate a contiguous pool of co-array storage per common each procedure has a private set of view variables (F90 pointers) initialize all per procedure view variables only once at launch after common allocation

    9. Porting to a new Compiler / Architecture Synthesize dope vectors for co-array storage compiler/architecture specific details: CHASM library Tailor communication to architecture design supports alternate communication libraries status today: ARMCI (PNL) ongoing work: compiler tailored communication direct load/store on shared-memory architectures future other portable libraries (e.g. GASnet) custom communication library for an architecture

    10. Supporting Multiple Co-dimensions A(:,:)[N,M,*] Add precomputed coefficients to co-array meta-data Lower, upper bounds for each co-dimension this_image_cache for each co-dimension e.g., this_image(a,1) yields my co-row index cum_hyperplane_size for each co-dimension

    11. Implementing Communication Given a statement X(1:n) = A(1:n)[p] + … A temporary buffer is used for off processor data invoke communication library to allocate tmp in suitable temporary storage dope vector filled in so tmp can be accessed as F90 pointer call communication library to fill in tmp (ARMCI GET) X(1:n) = tmp(1:n) + … deallocate tmp

    13. Supported Features Declarations co-objects: scalars and arrays COMMON and SAVE co-objects of primitive types INTEGER(4), REAL(4) and REAL(8) COMMON blocks: variables and co-objects intermixed co-objects with multiple co-dimensions procedure interface blocks with co-array arguments Executable code array section notation for co-array data indices local and remote co-arrays co-array argument passing co-array dummy arguments require explicit interface co-array pointer + communication handle co-array reshaping supported CAF intrinsics Image inquiry: this_image(…), num_images() Synchronization: sync_all, sync_team, synch_notify, synch_wait

    14. Coming Attractions Allocatable co-arrays REAL(8), ALLOCATABLE :: X(:)[*] ALLOCATE(X(MYX_NUM)[*]) Co-arrays of user-defined types Allocatable co-array components user defined type with pointer components Triplets in co-dimensions A(j,k)[p+1:p+4]

    16. A Preliminary Performance Study Platforms Alpha+Quadrics QSNet (Elan3) Itanium2+Quadrics QSNet II (Elan4) Itanium2+Myrinet 2000 Codes NAS Parallel Benchmarks (NPB) from NASA Ames

    17. Alpha+Quadrics Platform (Lemieux) Nodes: 750 Compaq AlphaServer ES35 4-way ES45 1-GHz Alpha EV6.8 (21264C), 64KB/8MB L1/L2 cache 4 GB RAM/node Interconnect: Quadrics QSNet (Elan3) 340 MB/s peak and 210 MB/s sustained x 2 rails Operating System: Tru64 Unix5.1A SC2.5 Compiler: HP Fortran Compiler V5.5A Communication Middleware: ARMCI 1.1-beta

    18. Itanium2+Quadrics Platform (PNNL) Nodes: 944 HP Long’s Peak dual-CPU workstations 1.5GHz Itanium2 32KB/256KB/6MB L1/L2/L3 cache 6GB RAM/node Interconnect: Quadrics QSNet II 905 MB/s Operating System: Red Hat Linux, 2.4.20 Compiler: Intel Fortran Compiler v7.1 Communication Middleware: ARMCI 1.1-beta

    19. Itanium2+Myrinet Platform (Rice) Nodes: 96 HP zx6000 dual-CPU workstations 900MHz Itanium2 32KB/256KB/1.5MB L1/L2/L3 cache 4GB RAM/node Interconnect: Myrinet 2000 240 MB/s GM version 1.6.5 MPICH-GM version 1.2.5 Operating System: Red Hat Linux, 2.4.18 + patches Compiler: Intel Fortran Compiler v7.1 Communication Middleware: ARMCI 1.1-beta

    20. NAS Parallel Benchmarks (NPB) 2.3 Benchmarks by NASA Ames 2-3K lines each (Fortran 77) Widely used to test parallel compiler performance NAS versions: NPB2.3b2 : Hand-coded MPI NPB2.3-serial : Serial code extracted from MPI version Our version NPB2.3-CAF: CAF implementation, based on the MPI version Caf version based on the mpi versions Preserve parallelizationCaf version based on the mpi versions Preserve parallelization

    21. NAS BT Block tridiagonal solve of 3D Navier Stokes Dense matrix Parallelization: alternating line sweeps along 3 dimensions multipartitioning data distribution for full parallelism MPI implementation asynchronous send/receive communication/computation overlap CAF communication strided blocks transferred using vector PUTs (triplet notation) no user-declared communication buffers Large messages, relatively infrequent communication

    22. NAS BT Efficiency (Class C)

    23. NAS SP Scalar pentadiagonal solve of 3D Navier Stokes Dense matrix Parallelization: alternating line sweeps along 3 dimensions multipartitioning data distribution for full parallelism MPI implementation asynchronous send/receive communication/computation overlap CAF communication pack into buffer; separate buffer for each plane of sweep transfer using PUTs smaller more frequent messages; 1.5x communication of BT

    24. NAS SP Efficiency (Class C)

    25. NAS MG 3D Multigrid solver with periodic boundary conditions Dense matrix Grid size and levels are compile time constants Communication nearest neighbor with possibly 6 neighbors MPI asynchronous send/receive CAF pairwise synch_notify/wait to coordinate with neighbors four communication buffers (co-arrays) used: 2 sender, 2 receiver pack and transfer contiguous data using PUTS for each dimension notify my neighbors that my buffers are free wait for my neighbors to notify me their buffers are free PUT data into right buffer, notify neighbor PUT data into left buffer, notify neighbor wait for both to complete

    26. NAS MG Efficiency (Class C)

    27. NAS LU Solve 3D Navier Stokes using SSOR Dense matrix Parallelization on power of 2 processors repeated decompositions on x and y until all processors assigned wavefront parallelism; small messages 5 words each MPI implementation asynchronous send/receive communication/computation overlap CAF two dimensional co-arrays morphed code to pack data for higher communication efficiency uses PUTs

    28. NAS LU Efficiency (Class C)

    29. NAS CG Conjugant gradient solve to compute eigenvector of large, sparse, symmetric, positive definite matrix MPI Irregular point-to-point messaging CAF: structure follows MPI Irregular notify/wait vector assignments for data transfer No communication/computation overlap for either

    30. NAS CG Efficiency (Class C) Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait

    31. CAF GET vs. PUT Communication Definitions GET: q_caf(n1:n2) = w(m1:m2)[reduce_exch_proc_noncaf(i)] PUT: q_caf(n1:n2)[reduce_exch_proc_noncaf(i)] = w(m1:m2) Study 64 procs, NAS CG class C Alpha+Quadrics Elan3 (Lemieux) Performance GET: 12.9% slower than MPI PUT: 4.0% slower than MPI

    32. Experiments Summary On cluster-based architectures, to achieve best performance with CAF, a user or compiler must vectorize (and perhaps aggregate) communication reduce synchronization strength replace all-to-all with point-to-point where sensible overlap communication with computation convert GETS into PUTS where gets are not a h/w primitive consider memory layout conflicts: co-array vs. regular data generate code amenable for back-end compiler optimizations CAF language: many optimizations possible at the source level Compiler optimizations NECESSARY for portable coding style might need user hints where synchronization analysis falls short Runtime issues on Myrinet pin co-array memory for direct transfers

    33. CAF Language Refinement Issues Initial implementations on Cray T3E and X1 led to features not suited for distributed memory platforms Key problems and solution suggestions Restrictive memory fence semantics for procedure calls pragmas to enable programmer to overlap one-sided communication with procedure calls Overly restrictive synchronization primitives add unidirectional, point-to-point synchronization rework team model (next slide) No collective operations Leads to home-brew non-portable implementations add CAF intrinsics for reductions, broadcast, etc.

    34. CAF Language Refinement Issues CAF dynamic teams lead to don’t scale pre-arranged “communicator-like” teams would help collectives: O(log P) rather than O(P2) reordering logical numbering of images for topology add shape information to image teams? Blocking communication reduces scalability user mechanisms to delay completion to enable overlap? Synchronization is not paired with data movement synchronization hint tags to help analysis synchronization tags at run-time to track completion? How relaxed should the memory model be for performance?

    35. Conclusions Tuned CAF performance is comparable to tuned MPI even without compiler-based communication optimizations! CAF programming model enables source-level optimization communication vectorization synchronization strength reduction achieve performance today rather than waiting for tomorrow’s compilers CAF is amenable to compiler analysis and optimization significant communication optimization is feasible, unlike for MPI optimizing compilers will help a wider range of programs achieve high performance applications can be tailored to fully exploit architectural characteristics e.g., shared memory vs. distributed memory vs. hybrid However, more abstract programming models would simplify code development (e.g. HPF)

    36. Project URL http://www.hipersoft.rice.edu/caf

    37. Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait

    38. Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait

    39. Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait

    40. Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait Conjugate gradient solver based on sparse-matrix vector multiply CAF implementation Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait

    41. Parallel Programming Models Goals: Expressiveness Ease of use Performance Portability Current models: OpenMP: difficult to map on distributed memory platforms HPF: difficult to obtain high-performance on broad range of programs MPI: de facto standard; hard to program, assumptions about communication granularity are hard coded UPC: global address space language; similar to CAF but with location transparency

    42. Finite Element Example (Numrich) subroutine assemble(start, prin, ghost, neib, x) integer :: start(:), prin(:), ghost(:), neib(:), k1, k2, p real :: x(:) [*] call sync_all(neib) do p = 1, size(neib) ! Add contributions from neighbors k1 = start(p); k2 = start(p+1)-1 x(prin(k1:k2)) = x(prin(k1:k2)) + x(ghost(k1:k2)) [neib(p)] enddo call sync_all(neib) do p = 1, size(neib) ! Update the neighbors k1 = start(p); k2 = start(p+1)-1 x(ghost(k1:k2)) [neib(p)] = x(prin(k1:k2)) enddo call synch_all end subroutine assemble This slide presents a gather-scatter operation; this code is part of an irregular application what I want to show is that it can be expressed compactly in co-array fortran (as you can see it all fits on one slide); that’s an example of the language expressivenessThis slide presents a gather-scatter operation; this code is part of an irregular application what I want to show is that it can be expressed compactly in co-array fortran (as you can see it all fits on one slide); that’s an example of the language expressiveness

    43. Communicating Private Data Example REAL:: A(100,100)[*], B(100) A(:,j)[p] = B(:) Issue B is a local array B is sent to a partner Will require a copy into shared space before transfer For higher efficiency want B in shared storage Alternatives Declare communicated arrays as co-arrays Add a communicated attribute to B’s declaration mark it for allocation in shared storage

    44. Passing Co-arrays as Arguments Language restriction: pass co-arrays by whole array REAL :: A(100,100)[*] CALL FOO(A) Callee must declare an explicit subroutine interface Proposed option: F90 assumed shape co-array arguments Allow passing of Fortran 90 style array sections of local co-array REAL :: A(100,100)[*] CALL FOO(A(1:10:2,3:25)) Callee must declare an explicit subroutine interface If matching dummy argument is declared as a co-array, then Must declare assumed size data dimensions Must declare assumed size co-dimensions Avoids copy-in, copy-out for co-array data

    45. Co-array Fortran (CAF) Explicitly-parallel extension of Fortran 90/95 defined by Numrich & Reid Global address space SPMD parallel programming model one-sided communication Simple, two-level memory model for locality management local vs. remote memory Programmer control over performance critical decisions data partitioning communication Suitable for mapping to a range of parallel architectures shared memory, message passing, hybrid, PIM The Co-Array Fortran (abbreviated CAF) language is an Explicitly-parallel extension of Fortran 90/95, developed by Numrich & Reid; it proposes a global address space SPMD parallel programming model with one-sided communication CAF uses a simple, two-level model that supports locality management; namely it distinguishes between local and remote memory In CAF the programmer has control over decisions such as data partitioning and communication One of the goals of CoArray Fortran is portable performance; the language is suitable for a wide range Of parallel architectures, such as shared memory, message passing, clusters of smps and PIM Belongs to the same language family as Uinifed Parallel C (UPC) and Titanium The Co-Array Fortran (abbreviated CAF) language is an Explicitly-parallel extension of Fortran 90/95, developed by Numrich & Reid; it proposes a global address space SPMD parallel programming model with one-sided communication CAF uses a simple, two-level model that supports locality management; namely it distinguishes between local and remote memory In CAF the programmer has control over decisions such as data partitioning and communication One of the goals of CoArray Fortran is portable performance; the language is suitable for a wide range Of parallel architectures, such as shared memory, message passing, clusters of smps and PIM Belongs to the same language family as Uinifed Parallel C (UPC) and Titanium

    46. CAF Programming Model Features SPMD process images fixed number of images during execution images operate asynchronously Both private and shared data real x(20, 20) a private 20x20 array in each image real y(20, 20) [*] a shared 20x20 array in each image Simple one-sided shared-memory communication x(:,j:j+2) = y(:,p:p+2) [r] copy columns from p:p+2 into local columns Synchronization intrinsic functions sync_all – a barrier and a memory fence sync_mem – a memory fence sync_team([notify], [wait]) notify = a vector of process ids to signal wait = a vector of process ids to wait for, a subset of notify Pointers and (perhaps asymmetric) dynamic allocation Parallel I/O A running CAF program consists of a fixed number of process images that operate asynchronously The locality of the data is made explicit by the language; x is declared as a private 20x20 array in each image Y,by the use of the [], is defined as a shared 20x20 array in each image Communication is realized by means of the co-arrays, using the bracket notation to specify remote image numbers A running CAF program consists of a fixed number of process images that operate asynchronously The locality of the data is made explicit by the language; x is declared as a private 20x20 array in each image Y,by the use of the [], is defined as a shared 20x20 array in each image Communication is realized by means of the co-arrays, using the bracket notation to specify remote image numbers

More Related