von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation"

von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea

Motivation for AM (review) How do we make parallel programs fast? • Minimize communication overhead • Overlap communication & computation (shoot for 100% utilization of all resources) • Consider the entire program • Communication • Computation • Interactions between the two

Message-Driven Architectures • Research systems – J-Machine/MDP, Monsoon, etc • Defining quality: all significant computation happens within the context of a handler • Computational model is basically dataflow programming - • Support languages with dynamic parallelism, e.g. MultiLISP • Interesting note: about 1/3 of all handlers in J-machine end up blocking and get swapped out by software • Pros: • Low overhead communication – reaction to lousy performance of send/recv model traditionally used in message-passing systems • Tight integration with network – directly "execute" messages • Cons: • Typically need hardware support in the NIC to achieve good performance - need more sophisticated buffering & scheduling • Poor locality of computation => small register sets and degraded raw computational performance (bad cache locality) • Poor cost/performance ratio, hard to program(?) • Number of handlers waiting to run at a given time is determined by excess parallelism in application, not arrival rate of messages

Message-Passing Architectures • Commercial systems – nCube, CM-5 • Defining feature: all significant computation happens in a devoted computational thread => good locality, performance • Traditional programming model is blocking, matched send/recv (implemented as 3-phase rendezvous) • Inherently a poor programming model for the lowest level: • Doesn't match the semantics of the NIC and performance gets lost in the translation • Doesn’t allow for overlap without expensive buffering • There's no compelling reason to keep this model as our lowest level network interface, even for this arch • Sometimes easier to program, but we want the lowest overhead interface possible as the NIC-level interface • Can easily provide a send/recv abstraction upon a more efficient interface • No way to recapture lost performance if the lowest level interface is slow

Active Messages - a new "mechanism" • Main idea: Take the best features of the message-driven model and unify them with the capabilities of message-passing hardware • Get the same or better performance as message-driven systems with little or no special-purpose hardware • Fix the mismatch between low-level software interface and hardware capabilities that cripples performance • Eliminate all buffering not required by transport • Expose out-of-order, asynchronous delivery • Need to restrict the allowable behavior of handlers somewhat to make this possible

Active Messages - Handlers • User-provided handlers that "execute" messages • Handlers run immediately upon message arrival • Handlers run quickly and to completion (no blocking) • Handlers run atomically with respect to each other • These restrictions make it possible to implement handlers with no buffering on simple message-passing hardware • The purpose of AM Handlers: • Quickly extract a message from the network and "integrate" the data into the running computation in an application-specific way, with a small amt of work • Handlers do NOT perform significant computation themselves • only the minimum functionality required to communicate • this is the crucial difference between AM and the message-driven model

Active Messages - Handlers (cont.) • Miscellaneous Restriction: • Communication is strictly request-reply (ensures acyclic protocol dependencies) • prevents deadlock with strictly bounded buffer space (assuming 2 virtual networks are available) • Still powerful enough to implement most if not all communication paradigms • Shared memory, message-passing, message-driven, etc • AM is especially useful as a compilation target for higher-level languages (Split-C, Titanium, etc) • Acceptable to trade off programmability and possibly some protection to maximize performance • Code often generated by a compiler anyhow, so guarding against naïve users is less critical

Proof of Concept: Split-C • Split-C: an explicitly parallel, SPMD version of C • Global address space abstraction, with a visible local/remote distinction • Split-phase, one-sided (asynchronous) remote memory operations • Sender executes put or get, then a sync on local counter for completion of 1 or more ops • User/compiler explicitly specifies prefetching to get overlap • Write in shared memory style, but remote operations explicit • local/global distinction important for high performance, so expose it to user • can also implement arbitrarily generalized data transfers (scatter-gather, strided) • Important points: • AM can efficiently provide global memory space on existing message-passing systems in software, using the right model • evolutionary change rather than revolutionary (keep the architecture) • works very well for coarse-grained SPMD apps

Results • Dramatic reduction in latency on commercial message-passing machines with NO additional hardware • nCUBE/2: • AM send/handle: 11us/15us overhead • Blocking message send/recv: 160us overhead • CM-5: • AM: <2us overhead • Blocking message send/recv: 86us overhead • About an order of magnitude improvement with no hardware investment

Optional Hardware/Kernel Support for AM • DMA transfer support => large messages • Registers on NIC for composing messages • General registers, not FIFOs - allow message reuse • Ability to compose a request & reply simultaneously • Fast user-level interrupts • Allow fully user-level interrupts (trap directly to handler) • PC injection is one way to do this • Any protection mechanisms required for kernel to allow user-level NIC interrupts • Support for efficient polling

Problems with AM-1 paper • Handler atomicity wrt. main computation • Addressed in vonEiken's thesis • Solutions: • Atomic instructions • Mechanism to temporarily disable NIC interrupts using a memory flag or reserved register • Described as an abstract mechanism, not a solid portable spec • Little support for recv protection, multi-threading, CLUMP's, abstract naming, etc • AM-2 fixes the above problems

GAM & Active Messages-2 • Done at Berkeley by Mainwaring, Culler, et al. • Standardized API & generalized somewhat • Adds support missing in AM-1 for: • multiple logical endpoints per application (modularity, multi-threading, multi-NIC) • non-SPMD configurations • recv-side protection mechanisms to catch non-malicious bugs (tags) • multi-threaded applications • level of indirection on handlers for non-aligned memory spaces (heterogeneous system) • fault-tolerance support for congestion, node failure, etc (return to sender) • opaque endpoint naming (client code portability, transparent multi-protocol implementations) • polling implicitly may happen on all calls, so explicit polls rarely required • enforce strict request/reply - eases implementation on some systems (HPAM)

Influence of Active Messages • Many implementations of AM in some form • natively on NIC's: Myrinet (NOW project), Via (Buonadonna & Begel), HP Medusa (Richard Martin), Intel Paragon (Liu), Meiko CS-2 (Schauser) • on other transports: TCP (Liu and Mainwaring) UDP (me), MPI (me), LAPI (Yau & Welcome) • other interesting: Multi-protocol AM (shared memory & network for CLUMPS) (Lumetta) • Used as compilation target for many parallel languages/systems: • Split-C, Id90/TAM, Titanium, PVM, UPC, MPI… • Influenced the design of important systems • E.g: IBM SP supercomputer: LAPI - low-level messaging layer that is basically AM

von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation"