1 / 20

Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology

Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach. Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology Appears in HPDC 1997 Presented by: Lei Yang. Background. Multiprocessor-based system models

eljah
Download Presentation

Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology Appears in HPDC 1997 Presented by: Lei Yang

  2. Background • Multiprocessor-based system models • Parallel vector processor (PVP) • Symmetric multiprocessor (SMP) • Massively parallel processor (MPP) • Distributed shared memory machine • Cluster of workstations (COW) • COW features • Each node is a complete workstation minus peripherals ( monitor, keyboard, mouse,…) • Nodes are connected through a commodity network, e.g., Ethernet, FDDI, ATM switch, etc • A complete OS resides in each node

  3. Motivation • Problem with COW • The inherent inability of scaling the performance of communication software along with the host CPU performance • High communication overhead : software overhead (time required for the preparation and authentication of the message) is significantly higher than hardware overhead (network setup and message propagation time) • Coprocessors on the network interface • Myrinet and ATM • But what should coprocessors do to minimize communication overheads?

  4. Motivation • Critical step is the reduction ofhost communication overheads,rather than network latency. • Why? • Many existing parallel applications are designed to hide network latencies; • Multithreaded applications typically cannot benefit significantly from improving network latencies below the cost of several user-level thread context switches; • In a cluster, in contrast to a parallel machine, the schedulers of distinct nodes are only loosely synchronized – this implies the existence of highly dynamic offsets among schedulers and therefore among cooperating application threads on the order of tens of microseconds.

  5. The VCM approach • VCM • Virtual Communication Machine • Enables applications to set a customized and lightweight communication path between their address spaces and the “wire” • Goal • Reduction of software communication overheads • How • Transfer selected communication-related processing activities from the host CPU(s) to the network coprocessor • A low-level abstraction between applications and coprocessor • Applications directly interact with VCM • Hide complexity via a user-level library • Usual protection via a kernel extension • VCM and applications operate asynchronously • VCM and applications use shared memory to communicate

  6. VCM features • The intelligent network interface VCM • They changed the name in a later journal version. • VCM has an active role • Access to application address space • Extensions to shared-memory applications • Zero-copy messaging available at both ends • sending • receiving • Communication related processing can be transferred to the network coprocessor • Buffer pages are managed by the application • The application itself knows its behavior better • Multiple VCM supported for each host

  7. VCM Architecture • Coprocessor is responsible for • Ensuring data integrity • Assembling/disassembling messages directly from/into an application’s data structure • Multiplexing/demultiplexing network messages • Enforcing protection • Three components • Virtual Communication Machine, implemented on the network coprocessor • A kernel extension module • For address space management and protection • A user-level library • Hide applications from the complexity of interacting with the VCM and the kernel extension

  8. Application–VCM interaction • Application access a VCM by registering • Extend a shared memory space with VCM - Command Area • Application and VCM interact via command area • Program or instruction completion is signaled using status words that are placed in the command area. • Asynchronous operations • Coprocessor polls for new programs to execute • Host CPU(s) check for program and instruction completion by polling the status words. • Data transfers are performed only by the coprocessor • Improve the performance • Loop interaction • Bursty invocations with many identical parameters

  9. Command Area

  10. Implementation • Platform • Cluster of Sun UltraSPARCs I Model 170 • Solaris 2.5 • FORE SBA-200E network cards • 25MHz i960 microprocessor

  11. Implementation • VCM interpreter • Running on the coprocessor • Order of requests • Protection-related instructions • VCM programs • Loop instructions • Incoming data • Protection and buffer page management • VCM accepts protection management instructions only from the kernel or from the connection server • VCM checks the correctness of all parameters received from an application • Messages longer than expected are truncated to the size of the receiving buffer

  12. Implementation • VCM instruction set

  13. Evaluation • Microbenchmarks • Synthetic client/server application • Ten client workstations issue back-to-back data requests to the server workstation • Traveling Salesman Problem (TSP) • Georgia Tech Time Warp (GTW) • A parallel kernel for discrete-event simulation • PHold, a synthetic application • PCS, a wireless network simulation

  14. Performance - Microbenchmarks The latency is linear with the message size The maximum send rate approaches the maximum data capacity of the wire

  15. Performance - Client/server application Outgoing bandwidth of the server as a function of the request size, when the server uses one or two interfaces.

  16. Performance – TSP

  17. Performance – PHold

  18. Performance – PCS

  19. Limitations • Requires special hardware • A network adapter card equipped with • A network coprocessor • A few megabytes of fast memory • One or more DMA under the control of the coprocessor • Network-specific hardware to help with performance critical processing (e.g., CRC). • How hard is it to port shared-memory applications to VCM-based COW?

  20. Conclusion • Host communication overhead is crucial • VCM • Flexibility of integration between network and application • Low overhead on the host processor • latency and bandwidth close to the hardware limits • Enables zero-copy messaging • Porting of certain shared-memory parallel applications to VCM-based COW. • Performance is desirable, contribution is valuable

More Related