110 likes | 240 Views
Low Latency Messaging Over Gigabit Ethernet. Keith Fenech CSAW 24 September 2004. Why Cluster Computing?. Ideal for computationally intensive applications. Multi-threaded processes allow jobs to be processed in parallel over multiple CPUs.
E N D
Low Latency MessagingOver Gigabit Ethernet Keith Fenech CSAW 24 September 2004
Why Cluster Computing? • Ideal for computationally intensive applications. • Multi-threaded processes allow jobs to be processed in parallel over multiple CPUs. • High Bandwidth allows interconnected nodes to achieve supercomputer performance. • Networks of Workstations (NOWs)1 • Easily available (commodity platforms) • Relatively cheap • Nodes may be used independently or as a cluster • Better utilization of idle computing resources. CSAW '04
High Performance Networking • Commodity networks dominated by IP over Ethernet • Performance is directly affected by: • Hardware – bus & network bandwidths • Latency – delay incurred in communicating a message from source to destination • Overhead – length of time that a processor is engaged in tx/rx of each message • Fine-grain threads communicate frequently using small messages. • HP communication architecture features: • transparency to the application layer • allow high-throughput for bandwidth intensive applications • low latencies for frequently communicating threads • Minimise protocol processing overhead on host machine • Gigabit performance not achievable at application layers. Why? CSAW '04
Conventional NICs & Protocols • Receiver node • Ethernet controller receives frame • Check CRC for frame • Filter MAC destination address • NIC generates HW interrupt to notify host • PCI transfer to host memory • CPU suspends current task & launches interrupt handler to service high priority interrupt • Check network layer (IP) header & verify checksum • Parse routing tables & store valid IP datagrams in IP buffer • Reassemble fragmented datagrams in host memory • Call transport layer (TCP/UDP) functions • Deliver packet to application layer CSAW '04
Problems With ConventionalProtocols & Architectures • NIC generates a CPU interrupt for each frame • Servicing interrupts involves expensive vertical switch to kernel space. • Software interrupts to pass IP datagrams to upper layers • Servicing incoming packets results in high host CPU load • Risk of Receiver Livelock scenarios (as in Denial of Service attacks) • PCI bus startup overheads for each message • Layered protocols implies expensive memory-to-memory buffer copies CSAW '04
Available Techniques • Bypass kernel for critical data paths • Buffer & protocol processing moved to user-space • User-level hardware access • Zero-copy techniques • Scatter/Gather techniques • Larger MTUs (Jumbo frames) • Larger DMA transfers avoid PCI startup overheads • Interrupt coalescing • Message descriptors & polling replace interrupts CSAW '04
Current Solutions • Enabled by programmable NICs • Virtual Interface Architecture (VIA2) • U-Net 3 (ATM) • Myrinet GM4 and Illinois FM5 (Myrinet) • QsNet6 (Quadrics) • EMP7 (Ethernet) CSAW '04
Our Proposal • NOWs running over Gigabit Ethernet • Use Tigon2 programmable NIC features (onboard CPU, memory, DMA) • Design a reliable lightweight communication protocol for GE • Reliable network (ordered & lossless packet delivery) • Low-overhead • Low-latency • Offload protocol processing from host CPU onto NIC CPU • Interrupt-free architecture (message descriptor queues + polling) • OS Bypass: user-applications & NIC hardware communicate through pinned down shared memory. • Zero Copy • Dynamic MTUs & DMA sizes – reduce PCI startup overheads • Tackle 2 application scenarios • Small messages – Latency is critical • Large bandwidth – Throughput is critical CSAW '04
Conclusion • Provide a high performance communication API • Replace PVM8 & MPI9 protocols • Fine-grained thread communication • High Bandwidth applications • Remove network communication bottleneck in user-level thread messaging. • Interface with SMASH10 • user-level thread scheduler • Multi-threaded applications can run seamlessly over a cluster of SMPs. • Achieve higher throughput with minimal usage of host CPU resources. CSAW '04
References • D. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun, S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, and F. Wong. Parallel Computing on the Berkeley NOW. In Ninth Joint Symposium on Parallel Processing, 1997. • Microsoft Compaq, Intel. Virtual Interface Architecture Specification, draft revision 1.0 edition, December 1997. • T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: a user-level network interface for parallel and distributed computing. In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 40–53. ACM Press, 1995. • Myricom Inc. Myrinet GM – the low-level message-passing system for Myrinet networks. • Scott Pakin, Mario Lauria, and Andrew Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet. 1995. • Fabrizio Petrini, Wu chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. Quadrics Network (QsNet): High-Performance Clustering Technology. In Hot Interconnects 9, Stanford University, Palo Alto, CA, August 2001. • Piyush Shivam, Pete Wyckoff, and Dhabaleswar Panda. EMP: Zero-copy OSbypass NIC-driven Gigabit Ethernet Message Passing. 2001. • Message Passing Interface Forum. MPI2: A Message Passing Interface standard. International Journal of High Performance Computing Applications, 12(1–2):1–299, 1998. • A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User’s Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge, Mass., 1994. • Kurt Debattista. High Performance Thread Scheduling on Shared Momory Multiprocessors. Master’s thesis, University of Malta, 2001. CSAW '04
Thank you! CSAW '04