200 likes | 374 Views
Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach. Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology Appears in HPDC 1997 Presented by: Lei Yang. Background. Multiprocessor-based system models
E N D
Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach Marcel Catalin Rosu, karsten Schwan, And Richard Fujimoto Georgia Inst. Of Technology Appears in HPDC 1997 Presented by: Lei Yang
Background • Multiprocessor-based system models • Parallel vector processor (PVP) • Symmetric multiprocessor (SMP) • Massively parallel processor (MPP) • Distributed shared memory machine • Cluster of workstations (COW) • COW features • Each node is a complete workstation minus peripherals ( monitor, keyboard, mouse,…) • Nodes are connected through a commodity network, e.g., Ethernet, FDDI, ATM switch, etc • A complete OS resides in each node
Motivation • Problem with COW • The inherent inability of scaling the performance of communication software along with the host CPU performance • High communication overhead : software overhead (time required for the preparation and authentication of the message) is significantly higher than hardware overhead (network setup and message propagation time) • Coprocessors on the network interface • Myrinet and ATM • But what should coprocessors do to minimize communication overheads?
Motivation • Critical step is the reduction ofhost communication overheads,rather than network latency. • Why? • Many existing parallel applications are designed to hide network latencies; • Multithreaded applications typically cannot benefit significantly from improving network latencies below the cost of several user-level thread context switches; • In a cluster, in contrast to a parallel machine, the schedulers of distinct nodes are only loosely synchronized – this implies the existence of highly dynamic offsets among schedulers and therefore among cooperating application threads on the order of tens of microseconds.
The VCM approach • VCM • Virtual Communication Machine • Enables applications to set a customized and lightweight communication path between their address spaces and the “wire” • Goal • Reduction of software communication overheads • How • Transfer selected communication-related processing activities from the host CPU(s) to the network coprocessor • A low-level abstraction between applications and coprocessor • Applications directly interact with VCM • Hide complexity via a user-level library • Usual protection via a kernel extension • VCM and applications operate asynchronously • VCM and applications use shared memory to communicate
VCM features • The intelligent network interface VCM • They changed the name in a later journal version. • VCM has an active role • Access to application address space • Extensions to shared-memory applications • Zero-copy messaging available at both ends • sending • receiving • Communication related processing can be transferred to the network coprocessor • Buffer pages are managed by the application • The application itself knows its behavior better • Multiple VCM supported for each host
VCM Architecture • Coprocessor is responsible for • Ensuring data integrity • Assembling/disassembling messages directly from/into an application’s data structure • Multiplexing/demultiplexing network messages • Enforcing protection • Three components • Virtual Communication Machine, implemented on the network coprocessor • A kernel extension module • For address space management and protection • A user-level library • Hide applications from the complexity of interacting with the VCM and the kernel extension
Application–VCM interaction • Application access a VCM by registering • Extend a shared memory space with VCM - Command Area • Application and VCM interact via command area • Program or instruction completion is signaled using status words that are placed in the command area. • Asynchronous operations • Coprocessor polls for new programs to execute • Host CPU(s) check for program and instruction completion by polling the status words. • Data transfers are performed only by the coprocessor • Improve the performance • Loop interaction • Bursty invocations with many identical parameters
Implementation • Platform • Cluster of Sun UltraSPARCs I Model 170 • Solaris 2.5 • FORE SBA-200E network cards • 25MHz i960 microprocessor
Implementation • VCM interpreter • Running on the coprocessor • Order of requests • Protection-related instructions • VCM programs • Loop instructions • Incoming data • Protection and buffer page management • VCM accepts protection management instructions only from the kernel or from the connection server • VCM checks the correctness of all parameters received from an application • Messages longer than expected are truncated to the size of the receiving buffer
Implementation • VCM instruction set
Evaluation • Microbenchmarks • Synthetic client/server application • Ten client workstations issue back-to-back data requests to the server workstation • Traveling Salesman Problem (TSP) • Georgia Tech Time Warp (GTW) • A parallel kernel for discrete-event simulation • PHold, a synthetic application • PCS, a wireless network simulation
Performance - Microbenchmarks The latency is linear with the message size The maximum send rate approaches the maximum data capacity of the wire
Performance - Client/server application Outgoing bandwidth of the server as a function of the request size, when the server uses one or two interfaces.
Limitations • Requires special hardware • A network adapter card equipped with • A network coprocessor • A few megabytes of fast memory • One or more DMA under the control of the coprocessor • Network-specific hardware to help with performance critical processing (e.g., CRC). • How hard is it to port shared-memory applications to VCM-based COW?
Conclusion • Host communication overhead is crucial • VCM • Flexibility of integration between network and application • Low overhead on the host processor • latency and bandwidth close to the hardware limits • Enables zero-copy messaging • Porting of certain shared-memory parallel applications to VCM-based COW. • Performance is desirable, contribution is valuable