1 / 29

Masaki Hirabaru masaki@nict.go.jp NICT

High Performance Data Transfer over TransPAC. The 3rd International HEP DataGrid Workshop August 26, 2004 Kyungpook National Univ., Daegu, Korea. Masaki Hirabaru masaki@nict.go.jp NICT. Acknowledgements. NICT Kashima Space Research Center Yasuhiro Koyama, Tetsuro Kondo

dakota
Download Presentation

Masaki Hirabaru masaki@nict.go.jp NICT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Data Transfer over TransPAC The 3rd International HEP DataGrid Workshop August 26, 2004 Kyungpook National Univ., Daegu, Korea Masaki Hirabaru masaki@nict.go.jpNICT

  2. Acknowledgements • NICT Kashima Space Research Center Yasuhiro Koyama, Tetsuro Kondo • MIT Haystack Observatory David Lapsley, Alan Whitney • APAN Tokyo NOC • JGN II NOC • NICT R&D Management Department • Indiana U. Global NOC

  3. Contents • e-VLBI • Performance Measurement • TCP test over TransPAC • TCP test in the Laboratory

  4. Motivations • MIT Haystack – NICT Kashima e-VLBI Experiment on August 27, 2003 to measure UT1-UTC in 24 hours • 41.54 GB CRL => MIT 107 Mbps (~50 mins) 41.54 GB MIT => CRL 44.6 Mbps (~120 mins) • RTT ~220 ms, UDP throughput 300-400 MbpsHowever TCP ~6-8 Mbps (per session, tuned) • BBFTP with 5 x 10 TCP sessions to gain performance • HUT – NICT Kashima Gigabit VLBI Experiment • - RTT ~325 ms, UDP throughput ~70 MbpsHowever TCP ~2 Mbps (as is), ~10 Mbps (tuned) • - Netants (5 TCP sessions with ftp stream restart extension) They need high-speed / real-time / reliable / long-haul high-performance, huge data transfer.

  5. VLBI (Very Long Baseline Interferometry) • e-VLBI geographically distributed observation, interconnecting radio antennas over the world • Gigabit / real-time VLBI multi-gigabit rate sampling radio signal from a star delay High Bandwidth – Delay Product Network issue A/D clock A/D clock Internet correlator Data rate 512Mbps ~ (NICT Kashima Radio Astronomy Applications Group)

  6. Recent Experiment of UT1-UTC Estimationbetween NICT Kashima and MIT Haystack (via Washington DC) • July 30, 2004 4am-6am JSTKashima was upgraded to 1G through JGN II 10G link.All processing done in ~4.5 hours (last time ~21 hours)Average ~30 Mbps transfer by bbftp (under investigation) test experiment

  7. Network Diagram for e-VLBI and test servers Seoul XP 10G Korea Kashima 100km Daejon bwctl server JGNII KOREN perf server 1G (10G) Taegu Tokyo XP 2.5G Kwangju Busan Koganei e-vlbi server 1G 1G(10G) 250km 2.5G SONET TransPAC / JGN II APII/JGNII 10G 2.5G Kitakyushu 1,000km 9,000km Chicago MIT Haystack 1G 1G (10G) Fukuoka Abilene 2.4G (x2) Genkai XP Fukuoka Japan 10G 4,000km Washington DC Los Angeles Indianapolis e-VLBI: • Done 1 Gbps upgrade at Kashima • On-going 2.5 Gbps upgrade at Haystack • Experiments using 1 Gigabit bps or more • Using real-time correlation *Info and key exchange page needed like: http://e2epi.internet2.edu/pipes/ami/bwctl/

  8. APAN JP Maps written in perl and fig2div

  9. Purposes • Measure, analyze and improve end-to-end performance in high bandwidth-delay product networks • to support for networked science applications • to help operations in finding a bottleneck • to evaluate advanced transport protocols (e.g. Tsunami, SABUL, HSTCP, FAST, XCP, [ours]) • Improve TCP under easier conditions • with a signle TCP stream • memory to memory • bottleneck but no cross traffic • Consume all the available bandwidth

  10. Path a) w/o bottleneck queue Access Backbone Access Sender Receiver B1 B3 B1 <= B2 & B1 <= B3 B2 b) w/ bottleneck queue Access Backbone Access Sender Receiver B2 B1 B3 B1 > B2 || B1 > B3 bottleneck

  11. TCP on a path with bottleneck queue overflow loss bottleneck The sender may generate burst traffic. The sender recognizes the overflow after the delay < RTT. The bottleneck may change over time.

  12. Limiting the Sending Rate 1Gbps 20Mbps throughput Sender Receiver a) congestion 100Mbps 90Mbps throughput Sender Receiver b) congestion better!

  13. Web100 (http://www.web100.org) • A kernel patch for monitoring/modifying TCP metrics in Linux kernel • We need to know TCP behavior to identify a problem. • Iperf (http://dast.nlanr.net/Projects/Iperf/) • TCP/UDP bandwidth measurement • bwctl (http://e2epi.internet2.edu/bwctl/) • Wrapper for iperf with authentication and scheduling

  14. 1st Step: Tuning a Host with UDP • Remove any bottlenecks on a host • CPU, Memory, Bus, OS (driver), … • Dell PowerEdge 1650 (*not enough power) • Intel Xeon 1.4GHz x1(2), Memory 1GB • Intel Pro/1000 XT onboard PCI-X (133Mhz) • Dell PowerEdge 2650 • Intel Xeon 2.8GHz x1(2), Memory 1GB • Intel Pro/1000 XT PCI-X (133Mhz) • Iperf UDP throughput 957 Mbps • GbE wire rate: headers: UDP(20B)+IP(20B)+EthernetII(38B) • Linux 2.4.26 (RedHat 9) with web100 • PE1650: TxIntDelay=0

  15. 2nd Step: Tuning a Host with TCP • Maximum socket buffer size (TCP window size) • net.core.wmem_max net.core.rmem_max (64MB) • net.ipv4.tcp_wmem net.tcp4.tcp_rmem (64MB) • Driver descriptor length • e1000: TxDescriptors=1024 RxDescriptors=256 (default) • Interface queue length • txqueuelen=100 (default) • net.core.netdev_max_backlog=300 (default) • Interface queue descriptor • fifo (default) • MTU • mtu=1500 (IP MTU) • Iperf TCP throughput 941 Mbps • GbE wire rate: headers: TCP(32B)+IP(20B)+EthernetII(38B) • Linux 2.4.26 (RedHat 9) with web100 • Web100 (incl. High Speed TCP) • net.ipv4.web100_no_metric_save=1 (do not store TCP metrics in the route cache) • net.ipv4.WAD_IFQ=1 (do not send a congestion signal on buffer full) • net.ipv4.web100_rbufmode=0 net.ipv4.web100_sbufmode=0 (disable auto tuning) • Net.ipv4.WAD_FloydAIMD=1 (HighSpeed TCP) • net.ipv4.web100_default_wscale=7 (default)

  16. Network Diagram for TransPAC/I2 Measurement (Oct. 2003) Kashima 100km server (general) 0.1G Tokyo XP server (e-VLBI) Koganei 1G sender TransPAC PE1650 Linux 2.4.22 (RH 9) Xeon 1.4GHz Memory 1GB GbE Intel Pro/1000 XT 1G x2 2.5G 9,000km MIT Haystack Abilene 1G 10G 4,000km Washington DC Los Angeles Indianapolis Iperf UDP ~900Mbps (no loss) 10G Mark5 Linux 2.4.7 (RH 7.1) P3 1.3GHz Memory 256MB GbE SK-9843 I2 Venue 1G receiver

  17. TransPAC/I2 #1: High Speed (60 mins)

  18. TransPAC/I2 #2: Reno (10 mins)

  19. TransPAC/I2 #3: High Speed (Win 12MB)

  20. Test in a laboratory – with bottleneck PE 2650 L2SW (FES12GCF) PE 1650 Sender Receiver • #1: Reno => Reno • #2: High Speed TCP => Reno GbE/T GbE/T GbE/SX PacketSphere Bandwidth 800Mbps Buffer 256KB Delay 88 ms Loss 0 2*BDP = 16MB

  21. Laboratory #1,#2: 800M bottleneck Reno HighSpeed

  22. Laboratory #3,#4,#5: High Speed (Limiting) Window Size (16MB) With limited slow-start (1000) Rate Control 270 us every 10 packets With limited slow-start (1000) Cwnd Clamp (95%) With limited slow-start (100)

  23. How to know when bottleneck changed • End host probes periodically (e.g. packet train) • Router notifies to the end host (e.g. XCP)

  24. Another approach: enough buffer on router • At least 2xBDP (bandwidth delay product) e.g. 1G bps x 200ms x 2 = 500Mb ~ 50MB • Replace Fast SRAM with DRAM in order to reduce space and cost

  25. Test in a laboratory – with bottleneck (2) PE 2650 L2SW (FES12GCF) PE 1650 Sender Receiver • #6: High Speed TCP => Reno GbE/T GbE/T GbE/SX Network Emulator Bandwidth 800Mbps Buffer 64MB Delay 88 ms Loss 0 2*BDP = 16MB

  26. Laboratory #6: 800M bottleneck HighSpeed

  27. Report on MTU • Increasing MTU (packet size) results in better performance. Standard MTU is 1500B. MTU 9KB is available throughout Abilene, TransPAC, APII backbones. • On Aug 25, 2004, a remaining link with 1500B was upgraded to 9KB in Tokyo XP. MTU 9KB is available from Busan to Los Angeles.

  28. Current and Future Plans of e-VLBI • KOA (Korean Observatory of Astronomy) has one existing radio telescope but in a different band from ours. They are building another three radio telescopes. • Using a dedicated light path from Europe to Asia through US is being considered. • e-VLBI Demonstration in SuperComputing2004 (November) is being planned, interconnecting radio telescopes from Europe, US, and Japan. • Gigabit A/D converter ready and now implementing 10G. • Our peformance measurement infrastructure will be merged into a framework of Global (Network) Observatory maintained by NOC people. (Internet2 piPEs, APAN CMM, and e-VLBI)

  29. Questions? • See http://www2.nict.go.jp/ka/radioastro/index.htmlfor VLBI

More Related