1 / 22

Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

10GbE WAN Data Transfers for Science High Energy/Nuclear Physics (HENP) SIG Fall 2004 Internet2 Member Meeting. Yang Xia, HEP, Caltech yxia@caltech.edu September 28, 2004 8:00 AM – 10:00 AM. Agenda. Introduction 10GE NIC comparisons & contrasts Overview of LHCnet

nelia
Download Presentation

Yang Xia, HEP, Caltech yxia@caltech September 28, 2004 8:00 AM – 10:00 AM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 10GbE WAN Data Transfers for ScienceHigh Energy/Nuclear Physics (HENP) SIGFall 2004 Internet2 Member Meeting Yang Xia, HEP, Caltech yxia@caltech.edu September 28, 2004 8:00 AM – 10:00 AM

  2. Agenda • Introduction • 10GE NIC comparisons & contrasts • Overview of LHCnet • High TCP performance over wide area networks • Problem statement • Benchmarks • Network architecture and tuning • Networking enhancements in Linux 2.6 kernels • Light paths : UltraLight • FAST TCP protocol development

  3. Introduction • High Engery Physics LHC model shows data at the experiment will be stored at the rate of 100 – 1500 Mbytes/sec throughout the year. • Many Petabytes per year of stored and processed binary data will be accessed and processed repeatedly by the worldwide collaborations. • Network backbone capacities advancing rapidly to 10 Gbps range and seamless integration into SONETs. • Proliferating GbE adapters on commodity desktops generates bottleneck on GbE Switch I/O ports. • More commercial 10GbE adapter products entering the market, e.g. Intel, S2io, IBM, Chelsio etc.

  4. IEEE 802.3ae Port Types

  5. 10GbE NICs Comparison (Intel vs S2io) • Standard Support: • 802.3ae Standard, full duplex only • 64bit/133MHz PCI-X bus • 1310nm SMF/850nm MMF • Jumbo Frame Support Major Difference in Performance Features:

  6. LHCnet Network Setup • 10 Gbps transatlantic link extended to Caltech via Abilene and CENIC. NLR wave local loop is in working progress. • High-performance end stations (Intel Xeon & Itanium, AMD Opteron) running both Linux and Windows • We have added a 64x64 Non-SONET all optical switch from Calient to provision a dynamic path via MonALISA, in the context of UltraLight.

  7. Glimmerglass StarLight CERN LHCnet Topology: August 2004 • Services: • IPv4 & IPv6 ; Layer2 VPN ; QoS ; scavenger ; large MTU (9k) ; MPLS ; links aggregation ; monitoring (Monalisa) • Clean separation of production and R&D traffic based on CCC. • Unique Multi-platform / Multi-technology optical transatlantic test-bed • Powerful Linux farms equipped with 10 GE adapters (Intel; S2io) • Equipment loan and donation; exceptional discount • NEW: Photonic switch (Glimmerglass T300) evaluation • Circuit (“pure” light path) provisioning Alcatel 7770 Procket 8801 Alcatel 7770 Procket 8801 Cisco 7609 Juniper M10 Cisco 7609 Juniper M10 Linux Farm 20 P4 CPU 6 TBytes Linux Farm 20 P4 CPU 6 TBytes 10GE 10GE 10GE LHCnet tesbed LHCnet tesbed Juniper T320 Juniper T320 10GE American Partners 10GE EuropeanPartners InternalNetwork Caltech/DoE PoP - Chicago OC192 (Production and R&D) CERN - Geneva

  8. GMPLS controlled PXCs and IP/MPLS routers can provide dynamic shortest path set-up and path setup based on priority of links. LHCnet Topology: August 2004 (cont’d) Optical Switch Matrix Calient Photonic Cross Connect Switch

  9. Problem Statement • To get the most bangs for the buck on 10GbE WAN, packet loss is the #1 enemy. This is because of slow TCP responsive from AIMD algorithm: • No Loss: cwnd := cwnd + 1/cwnd • Loss: cwnd := cwnd/2 • Fairness: TCP Reno MTU & RTT bias • Different MTUs and delays lead to a very poor sharing of the bandwidth.

  10. Internet 2 Land Speed Record (LSR) • IPv6 record: 4.0 Gbps between Geneva and Phoenix (SC2003) • IPv4 Multi-stream record with Windows: 7.09 Gbps between Caltech and CERN (11k km) • Single Stream 6.6 Gbps X 16.5 k km with Linux • We have exceeded 100 Petabit-m/sec with both Linux & Windows • Testing on different WAN distances doesn’t seem to change TCP rate: • 7k km (Geneva - Chicago) • 11k km (Normal Abilene Path) • 12.5k km (Petit Abilene's Tour) • 16.5k km (Grande Abilene's Tour) Monitoring of the Abilene Traffic in LA:

  11. Internet 2 Land Speed Record (cont’d) Single Stream IPv4 Category

  12. Primary Workstation Summary • Sending Station: • Newisys 4300, 4 x AMD Opteron 248 2.2GHz, 4GB PC3200/Processor. Up to 5 x 1GB/s 133MHz/64bit PCI-X slots. No FSB bottleneck. HyperTransport connects CPUs (up to 19.2GB/s peak BW per processor), 24 SATA disks RAID system @ 1.2GB/s read/write • Opteron white box with Tyan S2882 motherboard, 2x Opteron 2.4 GHz , 2 GB DDR. • AMD8131 chipset PCI-X bus speed: ~940MB/s • Receiving Station: • HP rx4640, 4x 1.5GHz Itanium-2, zx1 chipset, 8GB memory. • SATA disk RAID system

  13. Linux Tuning Parameters • PCI-X Bus Parameters: (via setpci command) • Maximum Memory Read Byte Count (MMRBC) controls PCI-X transmit burst lengths on the bus: Available values are 512Byte (default), 1024KB, 2048KB and 4096KB • “max_split_trans” controls outstanding splits. Available values are: 1, 2, 3, 4 • latency_timer to 248 • Interrupt Coalescence: • It allows a user to change the CPU-affinity of the interrupts in a system. • Large window size = BW*Delay (BDP) • Too large window size will negatively impact throughput. • 9000byte MTU and 64KB TSO

  14. Linux Tuning Parameters (cont’d) • Use sysctl command to modify /proc parameters to increase TCP memory values.

  15. 10GbE Network Testing Tools • In Linux: • Iperf: • Version 1.7.0 doesn’t work by default on the Itanium2 machine. Workarounds: 1) Compile using RedHat’s gcc 2.96 or 2) make it single threaded • UDP send rate limits to 2Gbps because of 32-bit date type • Nttcp: Measures the time required to send preset chunk of data. • Netperf (v2.1): Sends as much data as it can in an interval and collects result at the end of test. Great for end-to-end latency test. • Tcpdump: Challenging task for 10GbE link • In Windows: • NTttcp: Using Windows APIs • Microsoft Network Monitoring Tool • Ethereal

  16. Networking Enhancements in Linux 2.6 • 2.6.x Linux kernel has made many improvements in general to improve system performance, scalability and hardware drivers. • Improved Posix Threading Support (NGPT and NPTL) • Supporting AMD 64-bit (x86-64) and improved NUMA support. • TCP Segmentation Offload (TSO) • Network Interrupt Mitigation:Improved handling of high network loads • Zero-Copy Networking and NFS: One system call with: sendfile(sd, fd, &offset, nbytes) • NFS Version 4

  17. TCP Segmentation Offload • Must have hardware support in NIC. • It’s a sender only option. It allows TCP layer to send a larger than normal segment of data, e,g, 64KB, to the driver and then the NIC. The NIC then fragments the large packet intosmaller (<=mtu) packets. • TSO is disabled in multiple places in the TCP functions. It is disabled when sacks are received, in tcp_sacktag_write_queue, and when a packet is retransmitted, in tcp_retransmit_skb. However, TSO is never re-enabled in the current 2.6.8 kernel when TCP state changes back to normal (TCP_CA_Open). Need to patch the kernel to re-enable TSO. • Benefits: • TSO can reduce CPU overhead by 10%~15%. • Increase TCP responsiveness. • p=(C*RTT*RTT)/(2*MSS) • p: Time to recover to full rate • C: Capacity of the link • RTT: Round Trip Time • MSS: Maximum Segment Size

  18. Responsiveness with and w/o TSO

  19. The Transfer over 10GbE WAN • With 9000byte MTU and stock Linux 2.6.7 kernel: • LAN: 7.5Gb/s • WAN: 7.4Gb/s (Receiver is CPU bound) • We’ve reached the PCI-X bus limit with single NIC. Using bonding (802.3ad) of multiple interfaces we could bypass the PCI X bus limitation in mulple streams case only • LAN: 11.1Gb/s • WAN: ??? • (a.k.a. doom’s day for Abilene)

  20. UltraLight: Developing Advanced Network Services for Data Intensive HEP Applications • UltraLight(funded by NSF ITR): a next-generation hybrid packet- and circuit-switched network infrastructure. • Packet switched: cost effective solution; requires ultrascale protocols to share 10G  efficiently and fairly • Circuit-switched: Scheduled or sudden “overflow” demands handled by provisioning additional wavelengths; Use path diversity, e.g. across the US, Atlantic, Canada,… • Extend and augment existing grid computing infrastructures (currently focused on CPU/storage) to include the network as an integral component • Using MonALISA to monitor and manage global systems • Partners: Caltech, UF, FIU, UMich, SLAC, FNAL, MIT/Haystack; CERN, Internet2, NLR, CENIC; Translight, UKLight, Netherlight; UvA, UCL, KEK, Taiwan • Strong support from Cisco and Level(3)

  21. “Ultrascale” protocol development: FASTTCP • FAST TCP • Based on TCP Vegas • Uses end-to-end delay and loss to dynamically adjust the congestion window • Achieves any desired fairness, expressed by utility function • Very high utilization (99% in theory) • Compare to Other TCP Variants: e.g. BIC, Westwood+ Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow BW use 50% BW use 79% BW use 30% BW use 40% Linux TCP Linux Westwood+ Linux BIC TCP FAST

  22. Summary and Future Approaches • Full TCP offload engine will be available for 10GbE in the near future. There is a trade-off between maximizing CPU utilization and ensuring data integrity. • Develop and provide cost-effective transatlantic network infrastructure and services required to meet the HEP community's needs • a highly reliable and performance production network, with rapidly increasing capacity and a diverse workload. • an advanced research backbone for network and Grid developments: including operations and management assisted by agent-based software (MonALISA) • Concentrate on reliable Terabyte-scale file transfers, to drive development of an effective Grid-based Computing Model for LHC data analysis.

More Related