150 likes | 300 Views
A Generalized Portable SHMEM Library. Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory Ricky Kendall Ames Laboratory. Overview. Introduction global address space programming model one-sided communication Cray SHMEM
E N D
A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory Ricky Kendall Ames Laboratory
Overview • Introduction • global address space programming model • one-sided communication • Cray SHMEM • GPSHMEM - Generalized Portable SHMEM • Implementation Approach • Experimental Results • Conclusions
Communication model put P0 P1 (0xf5670,P0) one-sided communication (0xf32674,P5) But not P1 P2 P0 send receive P1 P0 message passing Global Address Space and1-Sided Communication collection of address spaces of processes in a parallel job global address: (address, pid) hardware examples: Cray T3E, Fujitsu VPP5000 language support: Co-Array Fortran, UPC
Motivation:global address space versus other programming models
One-sided communication interfaces • First commercial implementation - SHMEM on the Cray T3D • put, get, scatter, gather, atomic swap • memory consistency issues (solved on the T3E) • maps well to the Cray T3E hardware - excellent application performance • Vendors specific interfaces • IBM LAPI,Fujitsu MPlib, NEC Parlib/CJ, Hitachi RDMA, Quadrics Elan • Portable Interfaces • MPI-2 1-sided(related but rather restrictive model) • ARMCI one-sided communication library • SHMEM (some platforms) • GPSHMEM -- first fully portable implementation of SHMEM
History of SHMEM • Introduced in on the Cray T3D in 1993 • one-sided operations: put, get, scatter, gather, atomic swap • collective operations: synchronization, reduction • cache not coherent w.r.t. SHMEM operations (problem solved on the T3E) • highest level of performance on any MPP at that time • Increased availability • SGI after purchasing Cray ported to IRIX systems and Cray vector systems • but not always full functionality (w/o atomic ops on vector systems like Cray J90) • extensions to match more datatypes - SHMEM API is datatype oriented • HPVM project lead by Andrew Chien (UIUC/UCSD) • ported and extended a subset of SHMEM • on top of Fast Messages for Linux (later dropped) and Windows clusters • Quadrics/Compaq port to Elan • available on Linux and Tru64 clusters with QSW switch • subset on top of LAPI for the IBM SP • internal porting tool by the IBM ACTS group at Watson
Characteristics of SHMEM Symmetric object • Memory addressability • symmetric objects • stack, heap allocation on the T3D • Cray memory allocation routine shmalloc • Ordering of operations • ordered in the original version on the T3D • out-of-order on the T3E • adaptive routing, added shmem_quiet • Progress rules • fully one-sided, no explicit or implicit polling by remote node • much simpler model than MPI-2 1-sided • no redundant locking or remote process cooperation b a a P0 P1 shmem_put(a,b,n,0)
GPSHMEM • Full interface of the Cray T3D SHMEM version • Ordering of operations • Portability restriction: must use shmalloc for memory allocation • Extensions for block strided data transfers • the original Cray strided interface involved single elements • GPSHMEM shmem_strided_get( prem, ploc, rstride, lstride,nbytes, nblock, proc) shmem_iget shmem_strided_get prem ploc lstride lstride nbytes nblock Cray SHMEM GPSHMEM
GPSHMEM implementation approach one-sided operations collective operations SHMEM interfaces ARMCI message-passing library (MPI,PVM) Run-time support Platform-specific communication interfaces (active messages, RMC, threads, shared memory)
ARMCI portable 1-sided communication library • Functionality • put, get, accumulate (also with noncontiguous interfaces) • atomic read-modify-write,mutexes and locks • memory allocation operations • Characteristics • simple progress rules - truly one-sided • operations ordered w.r.t. target (ease of use) • compatible with message-passing libraries (MPI, PVM) • low-level system, no Fortran API • Portability • MPPs: Cray T3E, Fujitsu VPP, IBM SP (uses vendors 1-sided ops) • clusters of Unix and Windows systems (Myrinet,VIA,TCP/IP) • large servers with shared memory: SGI, Sun, Cray SV1, HP
AMs used for noncontiguous transfers and atomic operations Places all user’s data in shared memory! ARMCI_Malloc() 140 120 shared memory 100 80 bandwidth [MB/s] LAPI remote 60 LAPI SMP 40 20 0 1 100 10000 1000000 bytes Multiprotocols in ARMCI(IBM SP example) between nodes SMP Remote memory copy shared memory Active Messages threads Process/thread synchronization
Experience • Performance studies • GPSHMEM overhead over SHMEM on the Cray T3E • Comparison to MPI-2 1-sided on the Fujitsu VX-4 • Applications - see paper • matrix multiplication on a Linux cluster • porting Cray T3E codes
GPSHMEM Overhead on the T3E • Approach • renamed GPSHMEM calls to avoid conflict with Cray SHMEM • collected latency and bandwidth numbers • Overhead • shmem_put 3.5s • shmem_get 3s • bandwidth is the same since GPSHMEM and ARMCI do not add extra memory copies • Discussion • the overhead includes GPSHMEM and ARMCI • reflects address conversion • searching table of addresses for allocated objects • can be avoided when addresses are identical GPSHMEM ARMCI Cray SHMEM
Conclusions • Described a fully portable implementation of SHMEM-like library • SHMEM becomes a viable alternative to MPI-2 1-sided • Good performance closely tied up to ARMCI • Offers potential wide portability to other tools based on SHMEM • e.g. Co-Array Fortran • Cray SHMEM API incomplete for strided data structures • extensions for block strided transfers improve performance • More work with applications needed to drive future extensions and development • Code availability: rickyk@ameslab.gov