210 likes | 229 Views
Comparative Performance Analysis of RDMA-Enhanced Ethernet. Casey B. Reardon and Alan D. George HCS Research Laboratory University of Florida Gainesville, FL Clement T. Cole Ammasso Inc. Boston, MA. Outline. Paper Overview Background RDMA and iWARP AMSO 1100
E N D
Comparative Performance Analysis of RDMA-Enhanced Ethernet Casey B. Reardon and Alan D. George HCS Research Laboratory University of Florida Gainesville, FL Clement T. Cole Ammasso Inc. Boston, MA
Outline • Paper Overview • Background • RDMA and iWARP • AMSO 1100 • Benchmarks and Configuration • Experimental Testbed Overview • MPI Benchmarks Overview • Experiments and Results • Pallas Results • Gromacs Results • NPB Results • Analysis • Conclusions
Paper Overview • RDMA-enhanced Ethernet provides a new alternative for HPC networking • This work attempts to provide a useful performance comparison of a first-generation, RDMA-Ethernet implementation with two existing popular network technologies, namely: • Ethernet, the most common interconnect in distributed computing • InfiniBand, an existing high-performance RDMA network technology • Performance will be analyzed using a range of parallel benchmarks, from low-level to application tests • All benchmarks are based on MPI [1]
Background – RDMA and iWARP • RDMA permits data to be moved from memory subsystem of one system on network to another with limited involvement from either host CPU • New standards (e.g. iWARP) are enabling RDMA to be implemented using Ethernet as physical/datalink layers and TCP/IP as transport • Combines performance and latency advantages of RDMA while maintaining cost, maturity, and standardization benefits of Ethernet and TCP/IP RDMA Concept iWARP Layers
Payload Direct Virtual Application Address Data Memory Ethernet Translation Placement MAC Network RDMA IP & TCP RNIC Protocol Control Chip Background – AMSO 1100 • RDMA-capable Ethernet NIC from Ammasso, Inc. [6] • Implements iWarp, user and kernel DAPL, and MPI specifications [5, 2, 1] • Hardware interface extensions developed by UC Berkeley and Ohio State to call an IB-based RDMA implementation were modified to call Ammasso iWARP RDMA Verbs Block Diagram of AMSO 1100 RNIC
Experimental Testbed Overview • Cluster of 16 Server Nodes, each with: • Dual AMD Opteron 240 processors (only one used) • Tyan Thunder K8S motherboard w/ PCI-X bus slots • 1 GB memory • Suse Linux 9 • Conventional Gigabit Ethernet Network • On-board Broadcom PCI-X BCM5704C controller • Nortel Baystack-5510 GigE switch • LAM 7.1.1 MPI implementation [10] • RDMA-Based Gigabit Ethernet Network • AMSO 1100 PCI-X RNIC • Nortel Baystack-5510 GigE switch • Ammasso-distributed implementation of MPICH-1.2.5 • 4X InfiniBand Network • Voltaire 400LP Host Channel Adapter • Voltaire 9024 InfiniBand Switch Router • MVAPICH-0.9.5 MPI from Ohio State University [11]
MPI Benchmarks Overview • All experiments conducted at University of Florida’s HCS Laboratory • Selected tests from three MPI benchmark suites • PMB, Gromacs, NPB • Pallas MPI Benchmarks (PMB) [7] • Includes variety of raw, low-level MPI communication tests • Two tests included from this suite: PingPong and SendRecv • PingPong • Reports one-way latency as half of measured round-trip time • Throughput derived by dividing message size by latency • SendRecv • Forms periodic chain of participating nodes • Nodes simultaneously send messages to next node in chain while receiving from previous node • Reported latency is time for all nodes to complete a send • Throughput derived by dividing message size by latency
MPI Benchmarks Overview • Gromacs [8] • Developed at Groningen University’s Department of Biophysical Chemistry • Simulates molecular dynamics systems • Very latency-sensitive application • Simulations include numerous incremental time-steps • Synchronization before each time-step stresses network latencies • Three systems considered: Villin, DPPC, and LZM • These benchmarks are distributed by developers of Gromacs • Results reported represent simulation time that can be completed by the system in one day • NAS Parallel Benchmarks (NPB) [9] • Wide array of scientific applications • Two tests selected here: integer sort (IS) and conjugate gradient (CG) • Class B data sizes used in all tests Image courtesy [8]
Experimental Results: PMB - PingPong Throughput Latency • Latencies as low as 6 μs for IB using small messages • RDMA Ethernet showed latencies over 15 μs lower than conventional Ethernet for small messages
Experimental Results: PMB - SendRecv Throughput Latency • Results similar to those seen in PingPong tests • IB offers even larger advantage in latency and throughput over two Ethernet networks • RDMA Ethernet latencies are 20 μs less than conventional Ethernet
Experimental Results: PMB • IB also offers highest throughput as expected • 4X IB is capable of approx. ten times throughput of both Ethernets • Graphs become more skewed as message size grows, thus only messages up to 1 KB are included here • 20 μs latencies seen with RDMA are extremely low for Ethernet • This leads directly to higher throughputs observed with RDMA • Larger performance disparities seen in SendRecv results • SendRecv stresses network performance more than PingPong • Note: All numbers measured here are MPI latencies • Can be highly dependent upon MPI implementation
Experimental Results: Gromacs – Villin • IB scales better than either Ethernet in Villin system • Villin least computationally intensive of three Gromacs systems, leading to more frequent synchronizations • As system size grows, timesteps (and thus messaging) become more frequent • IB best equipped to handle bandwidth required of this traffic, but RDMA Ethernet surpasses conventional Ethernet • For system sizes up to 8 nodes, RDMA performs midway between IB and conventional Ethernet • Performance gap between conventional and RDMA Ethernet increases for system sizes less than 16 nodes
Experimental Results: Gromacs - LZM • RDMA Ethernet is able to scale fairly closely with IB for all system sizes • Performance almost identical until system size reaches or exceeds 8 nodes • With 12-node system, RDMA offers 25% enhancement over conventional Ethernet, while only 10% less than IB • Note: LZM benchmark is not capable of scaling beyond 12 nodes
Experimental Results: Gromacs - DPPC • DPPC exhibits least variance in performance between interconnects among Gromacs systems • Overall performance trends similar to those from LZM system • For all cases here, RDMA Ethernet performs more closely to IB than conventional Ethernet • Performance of IB and RDMA Ethernet are almost identical for systems with less than 8 nodes • Less than 10% performance difference at 16 nodes
Experimental Results: NPB - IS • Integer sort (IS) appears to be more network-sensitive than other tests in NPB suite • Uses high proportion of large messages • RDMA Ethernet offers significant improvement over conventional Ethernet in all cases in this study • RDMA Ethernet scales more closely with IB than conventional Ethernet
Memory accesses are highly prevalent in CG, causing super-linear speedups to be observed Conventional and RDMA Ethernet provide almost identical performance for all system sizes IB provides ~30% increase in performance over both Ethernet networks for intermediate sizes, but nearly identical at largest size considered here Experimental Results: NPB - CG
Analysis • As expected, IB provided the highest performance in all tests • IB offers higher bandwidths and lower latencies than both two Gigabit Ethernet networks considered here • RDMA Ethernet showed varying performance gains over conventional Ethernet in every scenario • Gains seen in both low-level and application-level experiments • For latency-sensitive applications, this step-up in performance is significant • Offloading of network processing provides additional advantages for RDMA networks • Results from tests such as LZM and DPPC in Gromacs, where IB and RDMA Ethernet scale comparably and significantly better than conventional Ethernet, suggests that RNIC provides processor offloading on par with IB • Detailed effects of RDMA processor offloading are left for future research • Despite lower latencies and processor offloading, RDMA cannot offer significant speedup to all applications, as expected • e.g. results from CG in NPB
Analysis • RDMA Ethernet can be a cost-effective alternative to expensive high-performance networks such as InfiniBand for applications sensitive to latency • An RNIC such as AMSO1100 featured here expected to sell for far less than typical IB HCAs • In addition, cost for Ethernet switches (per port) are a fraction of those for IB switches • RDMA Ethernet can use existing Ethernet switching infrastructures, further cutting possible implementation costs • Our study focused only on performance with a modest-sized cluster • Superior bandwidth offered by IB may provide better scalability to larger systems • As system size grows, so too will cost-disparity between IB and RDMA Ethernet implementations • High-number of additional switch ports needed to accommodate large systems may make an IB implementation infeasible for systems with limited budgets • Performance analyses with larger clusters are left for future research
Conclusions • As results in this study show, network interconnects can have a major impact on performance in distributed and parallel systems • IB still provides best overall performance in all applications among the networks considered, as expected • RDMA Ethernet showed significant improvement over conventional Ethernet in many applications • In some cases, performance of RDMA Ethernet approaches performance of IB • RDMA and iWARP offer an attractive technology with significant potential for achieving increasingly high performance at low cost • Savings become even greater when a user can leverage existing Ethernet infrastructure • One can foresee RDMA-capable Ethernet provided by default on all servers and operating with iWARP at 10 Gb/s data rates and more
References [1] MPI: A Message Passing Interface Standard v1.1, The MPI Forum, http://www.mpi-forum.org/docs/mpi-11.ps. [2] User-Level Direct Access Transport APIs(uDAPL v1.2), uDAPL Homepage, DAT Collaborative. http://www.datcollaborative.org/udapl.html. [3] T. Talpey, NFS/RDMA. IETF NFSv4 Interim WG meeting. June 4, 2003, http://ietf.cnri.reston.va.us/proceedings/03jul/slides/nfsv4-1.pdf. [4] S.Bailey and T. Talpey, The Architecture of Direct Data Placement (DDP) and Remote Direct Memory Access (RDMA) on Internet Protocols, The IETF Internet Report, Feb. 2, 2005, http://ietfreport.isoc.org/ids/draft-ietf-rddp-arch-07.txt. [6] Ammasso AMSO 1100, Ammasso Inc., http://www.ammasso.com/Ammasso_1100_Datasheet.pdf. [7] Pallas MPI Benchmarks – PMB, Part 1, Pallas GmbH., http://www.pallas.com.
References [8] Gromacs: The World’s Fastest Molecular Dynamics, Dept. of Biophysical Chemistry, Groningen University, http://www.gromcas.org. [9] NAS Parallel Benchmarks, NASA Advanced Supercomputing Division, http://www.nas.nasa.gov/Software/NPB/. [10] LAM/MPI: Enabling Efficient and Productive MPI Development, University of Indiana at Bloomington, http://www.lam-mpi.org/. [11] MVAPICH: MPI for InfiniBand over VAPI Layer, Networked-Based Computing Lab, Ohio State University, June 2003, http://nowlab-cis.ohio-state.edu/projects/mpi-iba/. [12] J. Boisseau, L. Carter, K. Gatlin, A. Majumdar, and A. Snavely, NAS Benchmarks on the Tera MTA, Proc. of Workshop on Multi-Threaded Execution, Architecture, and Compilers, Las Vegas, NV, February 1-4, 1998.