1 / 15

A Generalized Portable SHMEM Library

A Generalized Portable SHMEM Library. Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory Ricky Kendall Ames Laboratory. Overview. Introduction global address space programming model one-sided communication Cray SHMEM

selah
Download Presentation

A Generalized Portable SHMEM Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory Ricky Kendall Ames Laboratory

  2. Overview • Introduction • global address space programming model • one-sided communication • Cray SHMEM • GPSHMEM - Generalized Portable SHMEM • Implementation Approach • Experimental Results • Conclusions

  3. Communication model put P0 P1 (0xf5670,P0) one-sided communication (0xf32674,P5) But not P1 P2 P0 send receive P1 P0 message passing Global Address Space and1-Sided Communication collection of address spaces of processes in a parallel job global address: (address, pid) hardware examples: Cray T3E, Fujitsu VPP5000 language support: Co-Array Fortran, UPC

  4. Motivation:global address space versus other programming models

  5. One-sided communication interfaces • First commercial implementation - SHMEM on the Cray T3D • put, get, scatter, gather, atomic swap • memory consistency issues (solved on the T3E) • maps well to the Cray T3E hardware - excellent application performance • Vendors specific interfaces • IBM LAPI,Fujitsu MPlib, NEC Parlib/CJ, Hitachi RDMA, Quadrics Elan • Portable Interfaces • MPI-2 1-sided(related but rather restrictive model) • ARMCI one-sided communication library • SHMEM (some platforms) • GPSHMEM -- first fully portable implementation of SHMEM

  6. History of SHMEM • Introduced in on the Cray T3D in 1993 • one-sided operations: put, get, scatter, gather, atomic swap • collective operations: synchronization, reduction • cache not coherent w.r.t. SHMEM operations (problem solved on the T3E) • highest level of performance on any MPP at that time • Increased availability • SGI after purchasing Cray ported to IRIX systems and Cray vector systems • but not always full functionality (w/o atomic ops on vector systems like Cray J90) • extensions to match more datatypes - SHMEM API is datatype oriented • HPVM project lead by Andrew Chien (UIUC/UCSD) • ported and extended a subset of SHMEM • on top of Fast Messages for Linux (later dropped) and Windows clusters • Quadrics/Compaq port to Elan • available on Linux and Tru64 clusters with QSW switch • subset on top of LAPI for the IBM SP • internal porting tool by the IBM ACTS group at Watson

  7. Characteristics of SHMEM Symmetric object • Memory addressability • symmetric objects • stack, heap allocation on the T3D • Cray memory allocation routine shmalloc • Ordering of operations • ordered in the original version on the T3D • out-of-order on the T3E • adaptive routing, added shmem_quiet • Progress rules • fully one-sided, no explicit or implicit polling by remote node • much simpler model than MPI-2 1-sided • no redundant locking or remote process cooperation b a a P0 P1 shmem_put(a,b,n,0)

  8. GPSHMEM • Full interface of the Cray T3D SHMEM version • Ordering of operations • Portability restriction: must use shmalloc for memory allocation • Extensions for block strided data transfers • the original Cray strided interface involved single elements • GPSHMEM shmem_strided_get( prem, ploc, rstride, lstride,nbytes, nblock, proc) shmem_iget shmem_strided_get prem ploc lstride lstride nbytes nblock Cray SHMEM GPSHMEM

  9. GPSHMEM implementation approach one-sided operations collective operations SHMEM interfaces ARMCI message-passing library (MPI,PVM) Run-time support Platform-specific communication interfaces (active messages, RMC, threads, shared memory)

  10. ARMCI portable 1-sided communication library • Functionality • put, get, accumulate (also with noncontiguous interfaces) • atomic read-modify-write,mutexes and locks • memory allocation operations • Characteristics • simple progress rules - truly one-sided • operations ordered w.r.t. target (ease of use) • compatible with message-passing libraries (MPI, PVM) • low-level system, no Fortran API • Portability • MPPs: Cray T3E, Fujitsu VPP, IBM SP (uses vendors 1-sided ops) • clusters of Unix and Windows systems (Myrinet,VIA,TCP/IP) • large servers with shared memory: SGI, Sun, Cray SV1, HP

  11. AMs used for noncontiguous transfers and atomic operations Places all user’s data in shared memory! ARMCI_Malloc() 140 120 shared memory 100 80 bandwidth [MB/s] LAPI remote 60 LAPI SMP 40 20 0 1 100 10000 1000000 bytes Multiprotocols in ARMCI(IBM SP example) between nodes SMP Remote memory copy shared memory Active Messages threads Process/thread synchronization

  12. Experience • Performance studies • GPSHMEM overhead over SHMEM on the Cray T3E • Comparison to MPI-2 1-sided on the Fujitsu VX-4 • Applications - see paper • matrix multiplication on a Linux cluster • porting Cray T3E codes

  13. GPSHMEM Overhead on the T3E • Approach • renamed GPSHMEM calls to avoid conflict with Cray SHMEM • collected latency and bandwidth numbers • Overhead • shmem_put 3.5s • shmem_get 3s • bandwidth is the same since GPSHMEM and ARMCI do not add extra memory copies • Discussion • the overhead includes GPSHMEM and ARMCI • reflects address conversion • searching table of addresses for allocated objects • can be avoided when addresses are identical GPSHMEM ARMCI Cray SHMEM

  14. Performance of GPSHMEM and MPI-2 on the Fujitsu VX-4

  15. Conclusions • Described a fully portable implementation of SHMEM-like library • SHMEM becomes a viable alternative to MPI-2 1-sided • Good performance closely tied up to ARMCI • Offers potential wide portability to other tools based on SHMEM • e.g. Co-Array Fortran • Cray SHMEM API incomplete for strided data structures • extensions for block strided transfers improve performance • More work with applications needed to drive future extensions and development • Code availability: rickyk@ameslab.gov

More Related