1 / 13

von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation"

von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation". CS258 Lecture by: Dan Bonachea. Motivation for AM (review). How do we make parallel programs fast? Minimize communication overhead

quinta
Download Presentation

von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation"

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. von Eicken et al, "Active Messages: a Mechanism for Integrated Communication and Computation" CS258 Lecture by: Dan Bonachea

  2. Motivation for AM (review) How do we make parallel programs fast? • Minimize communication overhead • Overlap communication & computation (shoot for 100% utilization of all resources) • Consider the entire program • Communication • Computation • Interactions between the two

  3. Message-Driven Architectures • Research systems – J-Machine/MDP, Monsoon, etc • Defining quality: all significant computation happens within the context of a handler • Computational model is basically dataflow programming - • Support languages with dynamic parallelism, e.g. MultiLISP • Interesting note: about 1/3 of all handlers in J-machine end up blocking and get swapped out by software • Pros: • Low overhead communication – reaction to lousy performance of send/recv model traditionally used in message-passing systems • Tight integration with network – directly "execute" messages • Cons: • Typically need hardware support in the NIC to achieve good performance - need more sophisticated buffering & scheduling • Poor locality of computation => small register sets and degraded raw computational performance (bad cache locality) • Poor cost/performance ratio, hard to program(?) • Number of handlers waiting to run at a given time is determined by excess parallelism in application, not arrival rate of messages

  4. Message-Passing Architectures • Commercial systems – nCube, CM-5 • Defining feature: all significant computation happens in a devoted computational thread => good locality, performance • Traditional programming model is blocking, matched send/recv (implemented as 3-phase rendezvous) • Inherently a poor programming model for the lowest level: • Doesn't match the semantics of the NIC and performance gets lost in the translation • Doesn’t allow for overlap without expensive buffering • There's no compelling reason to keep this model as our lowest level network interface, even for this arch • Sometimes easier to program, but we want the lowest overhead interface possible as the NIC-level interface • Can easily provide a send/recv abstraction upon a more efficient interface • No way to recapture lost performance if the lowest level interface is slow

  5. Active Messages - a new "mechanism" • Main idea: Take the best features of the message-driven model and unify them with the capabilities of message-passing hardware • Get the same or better performance as message-driven systems with little or no special-purpose hardware • Fix the mismatch between low-level software interface and hardware capabilities that cripples performance • Eliminate all buffering not required by transport • Expose out-of-order, asynchronous delivery • Need to restrict the allowable behavior of handlers somewhat to make this possible

  6. Active Messages - Handlers • User-provided handlers that "execute" messages • Handlers run immediately upon message arrival • Handlers run quickly and to completion (no blocking) • Handlers run atomically with respect to each other • These restrictions make it possible to implement handlers with no buffering on simple message-passing hardware • The purpose of AM Handlers: • Quickly extract a message from the network and "integrate" the data into the running computation in an application-specific way, with a small amt of work • Handlers do NOT perform significant computation themselves • only the minimum functionality required to communicate • this is the crucial difference between AM and the message-driven model

  7. Active Messages - Handlers (cont.) • Miscellaneous Restriction: • Communication is strictly request-reply (ensures acyclic protocol dependencies) • prevents deadlock with strictly bounded buffer space (assuming 2 virtual networks are available) • Still powerful enough to implement most if not all communication paradigms • Shared memory, message-passing, message-driven, etc • AM is especially useful as a compilation target for higher-level languages (Split-C, Titanium, etc) • Acceptable to trade off programmability and possibly some protection to maximize performance • Code often generated by a compiler anyhow, so guarding against naïve users is less critical

  8. Proof of Concept: Split-C • Split-C: an explicitly parallel, SPMD version of C • Global address space abstraction, with a visible local/remote distinction • Split-phase, one-sided (asynchronous) remote memory operations • Sender executes put or get, then a sync on local counter for completion of 1 or more ops • User/compiler explicitly specifies prefetching to get overlap • Write in shared memory style, but remote operations explicit • local/global distinction important for high performance, so expose it to user • can also implement arbitrarily generalized data transfers (scatter-gather, strided) • Important points: • AM can efficiently provide global memory space on existing message-passing systems in software, using the right model • evolutionary change rather than revolutionary (keep the architecture) • works very well for coarse-grained SPMD apps

  9. Results • Dramatic reduction in latency on commercial message-passing machines with NO additional hardware • nCUBE/2: • AM send/handle: 11us/15us overhead • Blocking message send/recv: 160us overhead • CM-5: • AM: <2us overhead • Blocking message send/recv: 86us overhead • About an order of magnitude improvement with no hardware investment

  10. Optional Hardware/Kernel Support for AM • DMA transfer support => large messages • Registers on NIC for composing messages • General registers, not FIFOs - allow message reuse • Ability to compose a request & reply simultaneously • Fast user-level interrupts • Allow fully user-level interrupts (trap directly to handler) • PC injection is one way to do this • Any protection mechanisms required for kernel to allow user-level NIC interrupts • Support for efficient polling

  11. Problems with AM-1 paper • Handler atomicity wrt. main computation • Addressed in vonEiken's thesis • Solutions: • Atomic instructions • Mechanism to temporarily disable NIC interrupts using a memory flag or reserved register • Described as an abstract mechanism, not a solid portable spec • Little support for recv protection, multi-threading, CLUMP's, abstract naming, etc • AM-2 fixes the above problems

  12. GAM & Active Messages-2 • Done at Berkeley by Mainwaring, Culler, et al. • Standardized API & generalized somewhat • Adds support missing in AM-1 for: • multiple logical endpoints per application (modularity, multi-threading, multi-NIC) • non-SPMD configurations • recv-side protection mechanisms to catch non-malicious bugs (tags) • multi-threaded applications • level of indirection on handlers for non-aligned memory spaces (heterogeneous system) • fault-tolerance support for congestion, node failure, etc (return to sender) • opaque endpoint naming (client code portability, transparent multi-protocol implementations) • polling implicitly may happen on all calls, so explicit polls rarely required • enforce strict request/reply - eases implementation on some systems (HPAM)

  13. Influence of Active Messages • Many implementations of AM in some form • natively on NIC's: Myrinet (NOW project), Via (Buonadonna & Begel), HP Medusa (Richard Martin), Intel Paragon (Liu), Meiko CS-2 (Schauser) • on other transports: TCP (Liu and Mainwaring) UDP (me), MPI (me), LAPI (Yau & Welcome) • other interesting: Multi-protocol AM (shared memory & network for CLUMPS) (Lumetta) • Used as compilation target for many parallel languages/systems: • Split-C, Id90/TAM, Titanium, PVM, UPC, MPI… • Influenced the design of important systems • E.g: IBM SP supercomputer: LAPI - low-level messaging layer that is basically AM

More Related