200 likes | 349 Views
Steven Carter, Makia Minich, Nageswara Rao Oak Ridge National Laboratory {scarter,minich,rao}@ornl.gov. Experimental Evaluation of Infiniband as Local- and Wide-Area Transport. Motivation.
E N D
Steven Carter, Makia Minich, Nageswara Rao Oak Ridge National Laboratory {scarter,minich,rao}@ornl.gov Experimental Evaluation of Infiniband as Local- and Wide-Area Transport
Motivation • The Department of Energy established the Leadership Computing Facility at ORNL’s Center for Computational Sciences to field a 1PF supercomputer • The design chosen, the Cray XT series, includes an internal Lustre filesystem capable of sustaining reads and writes of 240GB/s • The problem with making the filesystem part of the machine is that it limits the flexibly of the Lustre filesystem and increases the complexity of the Cray • The problem with decoupling the filesystem from the machine is the high cost involved with to connect it via 10GE at the required speeds
Solution – Infiniband LAN/WAN • The Good: • Cool Name • Unified Fabric/IO Virtualization (i.e. low-latency interconnect, storage, and IP on one wire) • Faster link speeds (4x DDR = 16Gb/s, 4x QDR = 32Gb/s) • HCA does much of the heavy lifting • Nearly 10x less cost for similar bandwidth • Higher port density switches • The Bad: • IB sounds too much like IP • Lacks some of the accoutrements of IP/Ethernet (e.g. Firewall, Router, Sniffers (they exist but are hard to come by)) • Cray does not support Infiniband – (we can fix that)
CCS network roadmap summary Ethernet core scaled to match wide-area connectivity and archive Infiniband core scaled to match central file system and data transfer Lustre Baker Gateway Ethernet [O(10GB/s)] Infiniband [O(100GB/s)] Jaguar High-Performance Storage System (HPSS) Viz
Spider10 CCS Network 2007 Infiniband Ethernet 48-96 SDR 32 Jaguar 128 DDR 48 48 Spider60 48 64 DDR 64 DDR 48 48 Viz 48 32 DDR 20 4 HPSS 64 DDR 4 48 24 SDR 4 Devel 48 87 SDR 48 E2E 48 3 SDR 20 SDR 20
CCS IB network 2008 20 SDR 24 SDR Spider10 Devel 87 SDR Jaguar E2E 48-96 SDR (16-32 SDR/link) 64 DDR 32 DDR Viz 50 DDR HPSS 50 DDR 300 DDR (50 DDR/link) Baker Spider240
Porting IB to the Cray XT3 • PCI-X HCA required (PCIe should be available in the PF machine) • IB is not a standard option on the Cray XT3. Although the XT3's service nodes are based on SuSE Linux Enterprise 9, Cray kernel modifications make the kernel incompatible with stock version of OFED. • In order to compile OFED on the XT3 service node, the symbols bad_dma_address and dev_change_flags need to be exported from the IO node's kernel. • Furthermore, the OFED source code needs to be modified to recognize the particular version of the kernel run by the XT3's I/O nodes.
Voltaire 9024 Spider (Linux Cluster) Rizzo (XT3) XT3 IN LAN Testing Utilizing a server on spider (a commodity x86_64 Linux cluster), we were able to show the first successful infiniband implementation on the XT3. The basic RDMA test (as provided by the Open Fabrics Enterprise Distribution), allows us to see the maximum bandwidth that we could achieve (about 900MB/s unidirectionally) from the XT3's 133MHz PCI-X bus.
UC/RC: 4.4 Gb/s Verb Level LAN Testing (XT3) • IPoIB: 2.2 Gb/s • SDP: 3.6 Gb/s
RC/UC: 7.5 Gb/s Verb Level LAN Testing (generic x86_64) • IPoIB: 2.4 Gb/s • SDP: 4.6 Gb/s
LAN Observations • XT3's performance is good (better than 10GE) for RDMA • XT3's performance is not as good for verb level UC/UD/RC tests • XT3's IPoIB/SDP performance is worse than normal • XT3's poor performance might be a result of PCI-X HCA (known to be sub-optimal) or its relatively anaemic processor (single processor for XT3 vs. dual processor on the generic x86_64 host) • In general, IB should fit our LAN needs by providing good performance to Lustre and allow for the use of legacy applications such as HPSS over SDP • Although the XT3's performance is not ideal, it is as good as 10GE and is able to leverage it ~100 I/O nodes to make up the difference.
4x Infiniband SDR OC-192 SONET Lustre End-to-End IB over WAN testing Ciena CD-CI (SNV) • Placed 2 x Obsidian Longbow devices between Voltaire 9024 and Voltaire 9288 • Provisioned loopback circuits of various lengths on the DOE UltraScience Network and ran test. • RDMA Test Results: Local: 7.5Gbps (Longbow to Longbow) ORNL <-> ORNL (0.2mile): 7.5Gbps ORNL <-> Chicago (1400miles): 7.46Gbps ORNL <-> Seattle (6600 miles): 7.23Gbps ORNL <-> Sunnyvale (8600 miles): 7.2Gbps DOE UltraScience Network Obsidian Longbow Obsidian Longbow Ciena CD-CI (ORNL) Voltaire 9288 Voltaire 9024 20 81
UC: 7.5Gb/s RC: 7.5 Gb/s Chicago Loopback (1400 Miles) • IPoIB: ~450Mb/s • SDP: ~500Mb/s
UC: 7.2 Gb/s RC: 6.4 Gb/s Seattle loopback (6600 miles) • IPoIB: 120 Mb/s • SDP: 120 Mb/s
UC: 7.2 Gb/s RC: 6.8 Gb/s Sunnyvale loopback (8600 miles) • IPoIB: 75 Mb/s • SDP: 95 Mb/s
Seattle: 250 Maximum number of messages in flight • Sunnyvale: 240
WAN observations • The Obsidian Longbows appear to be extending sufficient link-level credits (UC works great) • RC only performs well at large messages sizes • There seems to be a maximum number of messages allowed in flight (~250) • RC performance does not increase rapidly enough even when message cap is not an issue • SDP performance likely poor due to the small message size used (page size) and poor RC performance • IPoIB performance likely due to interaction b/w IB and known TCP/IP problems • What the problem?
Obsidian's observations (Jason Gunthorpe) • Simulated distance with delay built into the Longbow • On Mellanox HCAs, the best performance is achieved with 240 operations outstanding at once on a single RC QP. Greater values result in 10% less bandwidth • The behaviour is different on PathScale HCAs (RC, 120ms delay): • 240 outstanding, 2MB message size: 679.002 MB/sec • 240 outstanding, 256K message size: 494.573 MB/sec • 250 outstanding, 256K message size: 515.604 MB/sec • 5000 outstanding, 32K message size: 763.325 MB/sec • 2500 outstanding, 32K message size: 645.292 MB/sec • Patching was required to fix QP timeout and 2K ACK window • Patched code yielded a max of 680 MB/s w/o delay
Conclusion • Infiniband makes a great data center interconnect (supports legacy applications via IBoIP and SDP) • There does not appear to be the same intrinsic problem with IB as there is with IP/Ethernet over long distances • The problem appears to be in the Mellanox HCA • This problem must be addressed to affectively use IB for Lustre and legacy applications
Contact Steven Carter Network Task Lead Center for Computational Sciences Oak Ridge National Laboratory (865) 576-2672 scarter@ornl.gov 20 Presenter_date