A Generalized Portable SHMEM Library

A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory Ricky Kendall Ames Laboratory

Overview • Introduction • global address space programming model • one-sided communication • Cray SHMEM • GPSHMEM - Generalized Portable SHMEM • Implementation Approach • Experimental Results • Conclusions

Communication model put P0 P1 (0xf5670,P0) one-sided communication (0xf32674,P5) But not P1 P2 P0 send receive P1 P0 message passing Global Address Space and1-Sided Communication collection of address spaces of processes in a parallel job global address: (address, pid) hardware examples: Cray T3E, Fujitsu VPP5000 language support: Co-Array Fortran, UPC

Motivation:global address space versus other programming models

One-sided communication interfaces • First commercial implementation - SHMEM on the Cray T3D • put, get, scatter, gather, atomic swap • memory consistency issues (solved on the T3E) • maps well to the Cray T3E hardware - excellent application performance • Vendors specific interfaces • IBM LAPI,Fujitsu MPlib, NEC Parlib/CJ, Hitachi RDMA, Quadrics Elan • Portable Interfaces • MPI-2 1-sided(related but rather restrictive model) • ARMCI one-sided communication library • SHMEM (some platforms) • GPSHMEM -- first fully portable implementation of SHMEM

History of SHMEM • Introduced in on the Cray T3D in 1993 • one-sided operations: put, get, scatter, gather, atomic swap • collective operations: synchronization, reduction • cache not coherent w.r.t. SHMEM operations (problem solved on the T3E) • highest level of performance on any MPP at that time • Increased availability • SGI after purchasing Cray ported to IRIX systems and Cray vector systems • but not always full functionality (w/o atomic ops on vector systems like Cray J90) • extensions to match more datatypes - SHMEM API is datatype oriented • HPVM project lead by Andrew Chien (UIUC/UCSD) • ported and extended a subset of SHMEM • on top of Fast Messages for Linux (later dropped) and Windows clusters • Quadrics/Compaq port to Elan • available on Linux and Tru64 clusters with QSW switch • subset on top of LAPI for the IBM SP • internal porting tool by the IBM ACTS group at Watson

Characteristics of SHMEM Symmetric object • Memory addressability • symmetric objects • stack, heap allocation on the T3D • Cray memory allocation routine shmalloc • Ordering of operations • ordered in the original version on the T3D • out-of-order on the T3E • adaptive routing, added shmem_quiet • Progress rules • fully one-sided, no explicit or implicit polling by remote node • much simpler model than MPI-2 1-sided • no redundant locking or remote process cooperation b a a P0 P1 shmem_put(a,b,n,0)

GPSHMEM • Full interface of the Cray T3D SHMEM version • Ordering of operations • Portability restriction: must use shmalloc for memory allocation • Extensions for block strided data transfers • the original Cray strided interface involved single elements • GPSHMEM shmem_strided_get( prem, ploc, rstride, lstride,nbytes, nblock, proc) shmem_iget shmem_strided_get prem ploc lstride lstride nbytes nblock Cray SHMEM GPSHMEM

GPSHMEM implementation approach one-sided operations collective operations SHMEM interfaces ARMCI message-passing library (MPI,PVM) Run-time support Platform-specific communication interfaces (active messages, RMC, threads, shared memory)

ARMCI portable 1-sided communication library • Functionality • put, get, accumulate (also with noncontiguous interfaces) • atomic read-modify-write,mutexes and locks • memory allocation operations • Characteristics • simple progress rules - truly one-sided • operations ordered w.r.t. target (ease of use) • compatible with message-passing libraries (MPI, PVM) • low-level system, no Fortran API • Portability • MPPs: Cray T3E, Fujitsu VPP, IBM SP (uses vendors 1-sided ops) • clusters of Unix and Windows systems (Myrinet,VIA,TCP/IP) • large servers with shared memory: SGI, Sun, Cray SV1, HP

AMs used for noncontiguous transfers and atomic operations Places all user’s data in shared memory! ARMCI_Malloc() 140 120 shared memory 100 80 bandwidth [MB/s] LAPI remote 60 LAPI SMP 40 20 0 1 100 10000 1000000 bytes Multiprotocols in ARMCI(IBM SP example) between nodes SMP Remote memory copy shared memory Active Messages threads Process/thread synchronization

Experience • Performance studies • GPSHMEM overhead over SHMEM on the Cray T3E • Comparison to MPI-2 1-sided on the Fujitsu VX-4 • Applications - see paper • matrix multiplication on a Linux cluster • porting Cray T3E codes

GPSHMEM Overhead on the T3E • Approach • renamed GPSHMEM calls to avoid conflict with Cray SHMEM • collected latency and bandwidth numbers • Overhead • shmem_put 3.5s • shmem_get 3s • bandwidth is the same since GPSHMEM and ARMCI do not add extra memory copies • Discussion • the overhead includes GPSHMEM and ARMCI • reflects address conversion • searching table of addresses for allocated objects • can be avoided when addresses are identical GPSHMEM ARMCI Cray SHMEM

Performance of GPSHMEM and MPI-2 on the Fujitsu VX-4

Conclusions • Described a fully portable implementation of SHMEM-like library • SHMEM becomes a viable alternative to MPI-2 1-sided • Good performance closely tied up to ARMCI • Offers potential wide portability to other tools based on SHMEM • e.g. Co-Array Fortran • Cray SHMEM API incomplete for strided data structures • extensions for block strided transfers improve performance • More work with applications needed to drive future extensions and development • Code availability: rickyk@ameslab.gov

A Generalized Portable SHMEM Library

A Generalized Portable SHMEM Library

Presentation Transcript

Designing a Portable Shader Library for Current and Future API's

Generalized Coordinates

Designing a Portable Shader Library for Current and Future APIs

Generalized Autofocus

A GENERALIZED HISTORY OF EARTH

Generalized Autofocus

Generalized Stabilizers

Generalized heat transfer coefficient Generalized Admittance Generalized Impedance

A Portable Graphics Library for Introductory CS

Generalized BPP

Using a Portable Drill

Designing a Portable Shader Library for Current and Future API's

A Generalized Stackelberg Model

GENERALIZED? Bandwidth

Biochem Shmem

Structure of a Generalized Cell

SHMEM Programming Model

UPC/SHMEM Language Analysis and Usability Study

GENERALIZED FLOWS

Buying a Portable Vaporizer

Portable Sink Rental | Rent a portable Sink (800.513.8562)