Low Latency Messaging Over Gigabit Ethernet

Low Latency MessagingOver Gigabit Ethernet Keith Fenech CSAW 24 September 2004

Why Cluster Computing? • Ideal for computationally intensive applications. • Multi-threaded processes allow jobs to be processed in parallel over multiple CPUs. • High Bandwidth allows interconnected nodes to achieve supercomputer performance. • Networks of Workstations (NOWs)1 • Easily available (commodity platforms) • Relatively cheap • Nodes may be used independently or as a cluster • Better utilization of idle computing resources. CSAW '04

High Performance Networking • Commodity networks dominated by IP over Ethernet • Performance is directly affected by: • Hardware – bus & network bandwidths • Latency – delay incurred in communicating a message from source to destination • Overhead – length of time that a processor is engaged in tx/rx of each message • Fine-grain threads communicate frequently using small messages. • HP communication architecture features: • transparency to the application layer • allow high-throughput for bandwidth intensive applications • low latencies for frequently communicating threads • Minimise protocol processing overhead on host machine • Gigabit performance not achievable at application layers. Why? CSAW '04

Conventional NICs & Protocols • Receiver node • Ethernet controller receives frame • Check CRC for frame • Filter MAC destination address • NIC generates HW interrupt to notify host • PCI transfer to host memory • CPU suspends current task & launches interrupt handler to service high priority interrupt • Check network layer (IP) header & verify checksum • Parse routing tables & store valid IP datagrams in IP buffer • Reassemble fragmented datagrams in host memory • Call transport layer (TCP/UDP) functions • Deliver packet to application layer CSAW '04

Problems With ConventionalProtocols & Architectures • NIC generates a CPU interrupt for each frame • Servicing interrupts involves expensive vertical switch to kernel space. • Software interrupts to pass IP datagrams to upper layers • Servicing incoming packets results in high host CPU load • Risk of Receiver Livelock scenarios (as in Denial of Service attacks) • PCI bus startup overheads for each message • Layered protocols implies expensive memory-to-memory buffer copies CSAW '04

Available Techniques • Bypass kernel for critical data paths • Buffer & protocol processing moved to user-space • User-level hardware access • Zero-copy techniques • Scatter/Gather techniques • Larger MTUs (Jumbo frames) • Larger DMA transfers avoid PCI startup overheads • Interrupt coalescing • Message descriptors & polling replace interrupts CSAW '04

Current Solutions • Enabled by programmable NICs • Virtual Interface Architecture (VIA2) • U-Net 3 (ATM) • Myrinet GM4 and Illinois FM5 (Myrinet) • QsNet6 (Quadrics) • EMP7 (Ethernet) CSAW '04

Our Proposal • NOWs running over Gigabit Ethernet • Use Tigon2 programmable NIC features (onboard CPU, memory, DMA) • Design a reliable lightweight communication protocol for GE • Reliable network (ordered & lossless packet delivery) • Low-overhead • Low-latency • Offload protocol processing from host CPU onto NIC CPU • Interrupt-free architecture (message descriptor queues + polling) • OS Bypass: user-applications & NIC hardware communicate through pinned down shared memory. • Zero Copy • Dynamic MTUs & DMA sizes – reduce PCI startup overheads • Tackle 2 application scenarios • Small messages – Latency is critical • Large bandwidth – Throughput is critical CSAW '04

Conclusion • Provide a high performance communication API • Replace PVM8 & MPI9 protocols • Fine-grained thread communication • High Bandwidth applications • Remove network communication bottleneck in user-level thread messaging. • Interface with SMASH10 • user-level thread scheduler • Multi-threaded applications can run seamlessly over a cluster of SMPs. • Achieve higher throughput with minimal usage of host CPU resources. CSAW '04

References • D. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun, S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, and F. Wong. Parallel Computing on the Berkeley NOW. In Ninth Joint Symposium on Parallel Processing, 1997. • Microsoft Compaq, Intel. Virtual Interface Architecture Specification, draft revision 1.0 edition, December 1997. • T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: a user-level network interface for parallel and distributed computing. In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 40–53. ACM Press, 1995. • Myricom Inc. Myrinet GM – the low-level message-passing system for Myrinet networks. • Scott Pakin, Mario Lauria, and Andrew Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet. 1995. • Fabrizio Petrini, Wu chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. Quadrics Network (QsNet): High-Performance Clustering Technology. In Hot Interconnects 9, Stanford University, Palo Alto, CA, August 2001. • Piyush Shivam, Pete Wyckoff, and Dhabaleswar Panda. EMP: Zero-copy OSbypass NIC-driven Gigabit Ethernet Message Passing. 2001. • Message Passing Interface Forum. MPI2: A Message Passing Interface standard. International Journal of High Performance Computing Applications, 12(1–2):1–299, 1998. • A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User’s Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge, Mass., 1994. • Kurt Debattista. High Performance Thread Scheduling on Shared Momory Multiprocessors. Master’s thesis, University of Malta, 2001. CSAW '04

Thank you! CSAW '04

Low Latency Messaging Over Gigabit Ethernet

Low Latency Messaging Over Gigabit Ethernet

Presentation Transcript

Ethernet, Fast Ethernet, and Gigabit Ethernet

Ethernet, Fast Ethernet, and Gigabit Ethernet

Gigabit Ethernet

Fast Ethernet and Gigabit Ethernet

Gigabit Ethernet and 10 Gigabit Ethernet signaling

Gigabit Ethernet

Gigabit Ethernet PMD

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet PMD

Gigabit Ethernet

Gigabit Ethernet TxRx

Optical Gigabit Ethernet

Low Cost, Long Haul Gigabit Ethernet

Optical Gigabit Ethernet

Enabling Ultra Low Latency Applications Over Ethernet

Optical Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet

Gigabit Ethernet PMD

Gigabit Ethernet PMD

Low Cost, Long Haul Gigabit Ethernet