190 likes | 436 Views
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand. Design and Implementation of MPICH-2 over InfiniBand with RDMA Support Liu, Jiang, Wyckoff, Panda, Ashton, Buntinas, Gropp, Toonen.
E N D
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with RDMA Support Liu, Jiang, Wyckoff, Panda, Ashton, Buntinas, Gropp, Toonen Host-Assisted Zero-Copy Remote Memory Access Communication On InfiniBand Tipparaju, Santhanaraman, Nieplocha, Panda Presented by Nikola Vouk Advisor: Dr. Frank Mueller
Background General Buffer Manipulation in Communication Protoocls
InfiniBand • 7.6 microsecond latency • 857 MB/s peak bandwidth • Send/Receive Queue+Work Completed interface • Asynchronous calls • Remote Direct Memory Access • Between Shared memory architecture and MPI • Not exactly NUMA, but close • Provides channel Interface (read/write) for communication • Each side registers memory accessible freely to other hosts for security purposes.
Common Problems • Link-layer/Network Protocol in-efficiencies (unnecessary messages sent) • User-space to System-Buffer copy overhead (copy time) • Synchronous sending/receiving and computing (Application has to stop in order to handle requests)
Get Operation Copy data from shared memory to user buffer Adjust Tail Pointer RDMA write new tail pointer to sender Return bytes read Problem 1: Message Passing Protocol Basic InfiniBand protocol requires three matching writes RDMA CHANNEL INTERFACE Put Operation: • Copy user buffer to pre-registered buffer • RDMA write buffer to receiver • Adjust local head pointer • RDMA write new head pointer to receiver • Return Bytes written
Solutions:Piggybacking and Pipelining Send Pointer update with Packets Chop buffers into packet size and Send out as message comes in Improvement, but still less than 870 MB/s
Problem 2: Internal buffer copying overheadSolution: Zero-Copy Buffers • Internal overhead where the user must copy data to system (and into a registered memory slot) • Allows system to read directly from the user
Zero-Copy Protocol at different Levels of MPICH Hierarchy If Packet is Large enough… Register user buffer Notify end-host of request End-host sends a RDMA-read Reads from user buffer space
Comparing Interfaces: CH3 interface vs RDMA Interface • Implement directly off of CH3 interface • More flexible due to access to complete ADI-3 interface • Always uses RMDA-write
CH3 Implementation Performance A function of raw underlying performance
Pipelining always performed the worst • RDMA Channel within 1% of CH3
Unanswered Problems Registration overhead still there even in cached version Data transfer still requires significant cooperation from both sides (taking away from computation) Non-contiguous data not addressed Solutions Provide custom API allocates out of large pre-registers memory chunks Overlapping as much as possible communication with computation Applying zero-copy techniques using scatter/gather RMDA calls Problem 3: To much overhead, not enough execution
Host-Assisted Zero-Copy Protocol • Host sends request for gather from receiver • Receiver posts a descriptor and continues working • Can be implemented as a “helper” thread on receiving host • Same as previous Zero-Copy idea, but extended to Non-contiguous data
NAS MG • Again the Pipelined method performs similarly to the zero-copy method
Summa Matrix Multiplication • Significant benefit of Host-Assisted Zero-Copy
Minimizing internal memory copying removes primary memory performance obstacle Infiniband allows DMA that offloads work from the CPU. Can benefit by coordinating registered memory to minimize CPU involvment With proper coding, can achieve almost wire-speed on existing MPI programs over infiniband Could be implemented on other architectures (Gig-E, Myranet) Conclusions
Thesis Implications • Buddy MPICH is a latency hiding implementation of MPICH also. • Separation at the ADI layer. Buddy thread listens for connections and accepts work from worker thread via send/receive queues.