150 likes | 239 Views
OFA Openframework WG. SHMEM/PGAS Feedback Worksheet 1/27/14. Purpose of meeting. Gather API requirements for SHMEM/PGAS, and more generally one-sided parallel program models for things like Memory registration Remote memory references Collectives
E N D
OFA Openframework WG SHMEM/PGAS Feedback Worksheet 1/27/14
Purpose of meeting Gather API requirements for SHMEM/PGAS, and more generally one-sided parallel program models for things like • Memory registration • Remote memory references • Collectives • Active message implementation considerations • Others (e.g. Topology)?
What is the Openframework WG Charter? • Develop a more extensible and more generalized framework that allows applications to make best use of current/future RDMA network technologies • Avoid fragmentation of the OFED software ecosystem into multiple vendor specific components – this is sort of happening now with PSM, MXM, FCA, etc.
Memory Registration API Requirements (1) • Necessary evil? • Some say yes if needed for good performance • Scalability is a very important property • API should be designed to allow for cases where only local memory registration info is required to access remote memory • On-demand paging nice, but not essential (see next slide)
Memory Registration API Requirements (2) • Fortran 2008 (CoArray) in particular could benefit from ability to register large amounts of virtual memory that may be only sparsely populated • Portals4 non-matching interface has an option which supports this if implementation supports PTL_TARGET_BIND_INACCESSIBLE • Compromise functionality where a large VM region can be “pre-registered” to prep I/O MMU and device driver, but does not pin pages. A second function would be called to pin segments of the region when remote memory access is needed. • Growable (both down and up) registrations would also be useful
Small remote memory reference API requirements – perf and ops • PGAS compilers implemented directly on top of native RDMA functionality need an API that can deliver high performance for small non-blocking PUTs (stores) and GETs (loads). Typical remote memory accesses much smaller than for MPI programs and many more of them • Atomic memory ops are important (see later slide) • Put with remote completion event
Small remote memory reference API requirements - ordering PGAS compilers in particular have special ordering requirements: • Most users (writers of UPC and Fortran 2008 CoArray apps) expect strict WAW and WAR to a given target from a given initiator unless explicitly requesting relaxed order via macros, etc. • Point to point ordering is required between remote memory transactions from a given initiator to the same address at a given target • Portals4 spec section 2.6.1 describes in a nutshell the kind of ordering needed by typical PGAS compilers for small puts/gets/amos • Ideally this type of ordering could be controlled on a per remote memory transaction basis (since language extensions do allow for users to specify relaxed ordering)
Small remote memory reference API requirements – completion notification • Lightweight completion notification is very desirable, especially for PUTs(stores), not as important for GETs (loads). • Ideally allow for batching of groups of PUT/GET requests with a single local completion notification • Existing examples of this type functionality are • Concept of non-blocking implicit operations that complete with a single gsynccall that only completes when all outstanding PUTs are globally visible – Quadrics SHMEM, Cray DMAPP • Portals4 counter like concept where a simple counter can be used to track completion of RDMA operations at the source. This API also has the option to deliver full event in the case of error to better help in recovery.
Small remote memory reference API requirements – Atomic memory ops • Rich set of AMOs also useful – need more than FADD and CSWAP. • Multi-element AMOs for active message support, etc. • Nice to have AMOs that can support MCS lock algorithm at scale – implies possibly needing 128 bit AMOs • Two kinds of AMO performance characteristics: • Low latency but reliable (either fail or succeed, result being reported back to initiator). This allows use of locks, queues, etc. without giving up on resilience to transient network errors. • “In memory” computation, but only need a good enough answer. Throughput more important than reliability. Example is Table Toy.
Large RDMA transfers • Not so different from MPI ? • Need RDMA read and write • Option for fences between transaction • Per transfer network ordering options would be great • Notification both for local completion (can reuse buffer in the case of a write) and global completion
Collectives • Would be nice to not have to reinvent the wheel for multiple, frequently concurrently used program models, e.g. app using SHMEM and MPI • A lower level common interface for frequently used collective operations – barrier, reductions, coalesce (allgather), alltoall • Flat would be better, not have to do special on-node (within a cache coherent) within the SHMEM/PGAS implementation
Active Message Support • Two general types of uses • Lower performance need for RPC-like capabilities implied by some types of operations, e.g. upc_global_alloc • Higher performance needed for PGAS languages implemented using an Active Message paradigm, as well as other active message based program models like Charm++ • API should support sending of requests to pre-initialized queues, for which the target has registered callback functions to process the data sent from initiator. Payload can be restricted to small size ~256 bytes or less. • Initiator of message should get response back that message has arrived, optionally that message has been consumed by callback function • Implementation needs to be able to handle transient network errors, message is delivered once and only once to target • Some applications may require ordering of messages from a given initiator to given target, would be good to be able to specify this at queue initialization. • Flow control is important. • PAMI is a good example of an interface that meets many of these requirements
References • http://www.cs.sandia.gov/Portals/portals4-spec.html • http://openshmem.org/ • http://www.openfabrics.org/downloads/OFWG/ • BM Parallel Environment Runtime Edition Version 1 Release 2: PAMI Programming Guide (SA23-2273-03) • Using the GNI and DMAPP APIs
Portals 4 Spec 2.6.1 The default ordering semantics for Portals messages differ for short and long messages. The threshold between “short” and “long” is defined by two parameters, the maximum write-after-write size (max_waw_ordered_size) and the maximum write-after-read size (max_war_ordered_size). Both parameters are controlled by the desired and actual arguments of PtlNIInit(). Note that replies and acknowledgments do not require ordering. When one message that stores data (put, atomic) is followed by a message that stores data or retrieves data (put, atomic, get) from the same initiator to the same target and both messages are less than the max_waw_ordered_size in length, a byte from the second message that targets the same offset within the same LE (or ME) as a byte from the first message will perform its access after the byte from the first message. Similarly, when one message that retrieves data (get) is followed by a second message that stores data (put,atomic) from the same initiator to the same target and both messages are less than max_war_ordered_size in length, a byte from the second message that targets the same offset within the same LE (or ME) as a byte from the first message will perform its access after the byte from the first message. The order in which individual bytes of a single message are delivered is always unspecified. The order in which non-overlapping bytes of two different messages is not specified unless the implementation sets the PTL_TOTAL_DATA_ORDERING option in the actual features limits field. When total data ordering is provided and the short message constraints are met, the first message must be entirely delivered before any part of the second message is delivered. Support for the ordering of bytes between messages is an optional feature, since some implementations may be unable to provide such strict ordering semantics.