Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with RDMA Support Liu, Jiang, Wyckoff, Panda, Ashton, Buntinas, Gropp, Toonen Host-Assisted Zero-Copy Remote Memory Access Communication On InfiniBand Tipparaju, Santhanaraman, Nieplocha, Panda Presented by Nikola Vouk Advisor: Dr. Frank Mueller

Background General Buffer Manipulation in Communication Protoocls

InfiniBand • 7.6 microsecond latency • 857 MB/s peak bandwidth • Send/Receive Queue+Work Completed interface • Asynchronous calls • Remote Direct Memory Access • Between Shared memory architecture and MPI • Not exactly NUMA, but close • Provides channel Interface (read/write) for communication • Each side registers memory accessible freely to other hosts for security purposes.

Common Problems • Link-layer/Network Protocol in-efficiencies (unnecessary messages sent) • User-space to System-Buffer copy overhead (copy time) • Synchronous sending/receiving and computing (Application has to stop in order to handle requests)

Get Operation Copy data from shared memory to user buffer Adjust Tail Pointer RDMA write new tail pointer to sender Return bytes read Problem 1: Message Passing Protocol Basic InfiniBand protocol requires three matching writes RDMA CHANNEL INTERFACE Put Operation: • Copy user buffer to pre-registered buffer • RDMA write buffer to receiver • Adjust local head pointer • RDMA write new head pointer to receiver • Return Bytes written

Solutions:Piggybacking and Pipelining Send Pointer update with Packets Chop buffers into packet size and Send out as message comes in Improvement, but still less than 870 MB/s

Problem 2: Internal buffer copying overheadSolution: Zero-Copy Buffers • Internal overhead where the user must copy data to system (and into a registered memory slot) • Allows system to read directly from the user

Zero-Copy Protocol at different Levels of MPICH Hierarchy If Packet is Large enough… Register user buffer Notify end-host of request End-host sends a RDMA-read Reads from user buffer space

Comparing Interfaces: CH3 interface vs RDMA Interface • Implement directly off of CH3 interface • More flexible due to access to complete ADI-3 interface • Always uses RMDA-write

CH3 Implementation Performance A function of raw underlying performance

Pipelining always performed the worst • RDMA Channel within 1% of CH3

Unanswered Problems Registration overhead still there even in cached version Data transfer still requires significant cooperation from both sides (taking away from computation) Non-contiguous data not addressed Solutions Provide custom API allocates out of large pre-registers memory chunks Overlapping as much as possible communication with computation Applying zero-copy techniques using scatter/gather RMDA calls Problem 3: To much overhead, not enough execution

Host-Assisted Zero-Copy Protocol • Host sends request for gather from receiver • Receiver posts a descriptor and continues working • Can be implemented as a “helper” thread on receiving host • Same as previous Zero-Copy idea, but extended to Non-contiguous data

NAS MG • Again the Pipelined method performs similarly to the zero-copy method

Summa Matrix Multiplication • Significant benefit of Host-Assisted Zero-Copy

Minimizing internal memory copying removes primary memory performance obstacle Infiniband allows DMA that offloads work from the CPU. Can benefit by coordinating registered memory to minimize CPU involvment With proper coding, can achieve almost wire-speed on existing MPI programs over infiniband Could be implemented on other architectures (Gig-E, Myranet) Conclusions

Thesis Implications • Buddy MPICH is a latency hiding implementation of MPICH also. • Separation at the ADI layer. Buddy thread listens for connections and accepts work from worker thread via send/receive queues.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Presentation Transcript

Communication latency in distributed memory systems

Communication Network Protocols

Minimizing Communication in Linear Algebra

Predicting Communication Latency in the Internet

Communication Network

Communication Network Protocols

Network Communication Hardware

Data Communication Network

Minimizing Communication in Linear Algebra

Minimizing Communication in Linear Algebra

Robot Communication over Wireless Ad Hoc Network

Communication Network Protocols

High Performance Communication for Oracle using InfiniBand

Data Communication Network

Network Flow Watermarking Attack on Low-Latency Anonymous Communication Systems

Network Communication Hardware

Network Communication Media

TinyOS Network Communication

Network Communication Platform

Communication Network Protocols

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand

Network & Communication Cables