260 likes | 404 Views
Realizing the Performance Potential of the Virtual Interface Architecture. Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of Electrical and Computer Engineering Presented by Constantin Serban, R.U. VIA Goals.
E N D
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of Electrical and Computer Engineering Presented by Constantin Serban, R.U.
VIA Goals • Communication infrastructure for System Area Networks (SANs) • Targets mainly high speed cluster applications • Efficiently harnesses the communication performance of underlying networks
Trends • The peak bandwidth increase two order of magnitude over past decade while user latency decreased modestly. • The latency introduced by the protocol is typically several times the latency of the transport layer. • The problem becomes acute especially for small messages
Targets VI architecture addresses the following issues: • Decrease the latency especially for small messages (used in synchronization) • Increase the aggregate bandwidth (only a fraction of the peak bandwidth is utilized) • Reduce the CPU processing due to the message overhead
Overhead Overhead mainly comes from two sources: • Every network access requires one-two traps into the kernel • user/kernel mode switch is time consuming • Usually two data copies occur: • From the user buffer to the message passing API • From message layer to the kernel buffer
VIA approach • Remove the kernel from the critical path • Moving communication code out of the kernel into user space • Provide 0-copy protocol • Data is sent/received directly into the user buffer, no message copy is performed
VIA emerged as a standardization effort from Compaq, Intel, and Microsoft It was built on several academic ideas: • The main architecture most similar to U-Net • Essential features derived from VMMC Among current implementations : • GigaNet cLan – VIA implemented in hardware • Tandem ServerNet –VIA software driver emulated • Myricom Myrinet - software emulated in firmware
VIA operations Set-Up/Tear-Down : • VIA is point-to-point connection oriented protocol • VI-endpoint : the core concept in VIA • Register/De-Register Memory • Connect/Disconnect • Transmit • Receive • RDMA
VIA operations Set-Up/Tear-Down :VIA is point-to-point connection oriented protocol • VI-endpoint : the core concept in VIA • VipCreateVi function creates a VI endpoint in the user space. • The user-level library passes the call to the kernel agent which passes the creation information to the NIC. • OS thus controls the application access to the NIC
VIA operations - cont’d Register/De-Register Memory: • All data buffers and descriptors reside in a registered memory • NIC performs DMA I/O operation in this registered memory • Registration pins down the pages into the physical memory and provides a handle to manipulate the pages and transfer the addresses to the NIC • It is performed once, usually at the beginning of the communication session
VIA operations - cont’d Connect/Disconnect: • Before communication, each endpoint is connected to a remote endpoint • The connection is passed to the kernel agent and down to the NIC • VIA does not define any addressing scheme, existing schemes can be used in various implementations
VIA operations - cont’d Transmit/receive: • The sender builds a descriptor for the message to be sent. The descriptor points to the actual data buffer. Both descriptor and data buffer resides in a registered memory area. • The application then posts a doorbell to signal the availability of the descriptor.The doorbell contains the address of the descriptor. • The doorbells are maintained in an internal queue inside the NIC
VIA operations - cont’d Transmit/receive (cont’d): • Meanwhile, the receiver creates a descriptor that points to an empty data buffer and posts a doorbell in the receiver NIC queue • When the doorbell in the sender queue has reached the top of the queue, through a double indirection the data is sent into the network. • The first doorbell/ descriptor is picked up from the receiver queue and the buffer is filled out with data
VIA operations - cont’d RDMA: • As a mechanism derived from VMMC, VIA allows Remote DMA operations: RDMA Read and Write • Each node allocates a receive buffer and registers it with the NIC. Additional structures that contain read and write pointers to the receive buffers are exchanged during connection setu • Each node can read and write to the remote node address directly. • These operations posts potential implementation problems.
Evaluation Benchmarks • Two VI implementations : • GigaNet cLan B:125MB/sec, Latency 480ns • Tandem ServerNet, 50MB/S, Latency 300ns • Performance measured: • Bandwidth and Latency • Poling vs. Blocking • CPU Utilization
MPI performance using VIA • The challenge is to deliver performance to distributed application • Software layers such MPI are mostly used between VIA and the application: provide increased usability but they bring additional overhead • How to optimize this layer in order to use it efficiently with VIA ?
MPI observations • Difference between MPI-UDP and MPI-VIA-baseline is remarkable • MPI-VIA-baseline is dramatically far from VIA-Native • Several improvements proposed to shift MPI-Via to be closer to VIA native : reduce MPI overhead
MPI Improvements • Eliminating unnecessary copies: MPI UDP and VIA use a single set of receiving buffers, thus data should be copied to the application : allow the user to register any buffer • Choosing a synchronization primitive: All synchronization formerly using OS constructs/events. Better implementation using swap processor commands • No Acknowledge: Remove the acknowledge of the message by switching to a reliable VIA mode
VIA - Disadvantages • Polling vs. blocking synchronization – a tradeoff between CPU consumption and overhead • Memory registration: locking large amount of memory makes virtual memory mechanisms inefficient. Registering / deregistering on the fly is slow • Point-to-point vs. multicast: VIA lacks multicast primitives. Implementing multicast over the actual mechanism, makes communication inefficient
Conclusion • Small latency for small messages. Small messages have a strong impact on application behavior • Significant improvement over UDP communication (still after recent TCP/UDP hardware implementations?) • At the expense of an uncomfortable API