Masaki Hirabaru masaki@nict.go.jp NICT

High Performance Data Transfer over TransPAC The 3rd International HEP DataGrid Workshop August 26, 2004 Kyungpook National Univ., Daegu, Korea Masaki Hirabaru masaki@nict.go.jpNICT

Acknowledgements • NICT Kashima Space Research Center Yasuhiro Koyama, Tetsuro Kondo • MIT Haystack Observatory David Lapsley, Alan Whitney • APAN Tokyo NOC • JGN II NOC • NICT R&D Management Department • Indiana U. Global NOC

Contents • e-VLBI • Performance Measurement • TCP test over TransPAC • TCP test in the Laboratory

Motivations • MIT Haystack – NICT Kashima e-VLBI Experiment on August 27, 2003 to measure UT1-UTC in 24 hours • 41.54 GB CRL => MIT 107 Mbps (~50 mins) 41.54 GB MIT => CRL 44.6 Mbps (~120 mins) • RTT ~220 ms, UDP throughput 300-400 MbpsHowever TCP ~6-8 Mbps (per session, tuned) • BBFTP with 5 x 10 TCP sessions to gain performance • HUT – NICT Kashima Gigabit VLBI Experiment • - RTT ~325 ms, UDP throughput ~70 MbpsHowever TCP ~2 Mbps (as is), ~10 Mbps (tuned) • - Netants (5 TCP sessions with ftp stream restart extension) They need high-speed / real-time / reliable / long-haul high-performance, huge data transfer.

VLBI (Very Long Baseline Interferometry) • e-VLBI geographically distributed observation, interconnecting radio antennas over the world • Gigabit / real-time VLBI multi-gigabit rate sampling radio signal from a star delay High Bandwidth – Delay Product Network issue A/D clock A/D clock Internet correlator Data rate 512Mbps ~ (NICT Kashima Radio Astronomy Applications Group)

Recent Experiment of UT1-UTC Estimationbetween NICT Kashima and MIT Haystack (via Washington DC) • July 30, 2004 4am-6am JSTKashima was upgraded to 1G through JGN II 10G link.All processing done in ~4.5 hours (last time ~21 hours)Average ~30 Mbps transfer by bbftp (under investigation) test experiment

Network Diagram for e-VLBI and test servers Seoul XP 10G Korea Kashima 100km Daejon bwctl server JGNII KOREN perf server 1G (10G) Taegu Tokyo XP 2.5G Kwangju Busan Koganei e-vlbi server 1G 1G(10G) 250km 2.5G SONET TransPAC / JGN II APII/JGNII 10G 2.5G Kitakyushu 1,000km 9,000km Chicago MIT Haystack 1G 1G (10G) Fukuoka Abilene 2.4G (x2) Genkai XP Fukuoka Japan 10G 4,000km Washington DC Los Angeles Indianapolis e-VLBI: • Done 1 Gbps upgrade at Kashima • On-going 2.5 Gbps upgrade at Haystack • Experiments using 1 Gigabit bps or more • Using real-time correlation *Info and key exchange page needed like: http://e2epi.internet2.edu/pipes/ami/bwctl/

APAN JP Maps written in perl and fig2div

Purposes • Measure, analyze and improve end-to-end performance in high bandwidth-delay product networks • to support for networked science applications • to help operations in finding a bottleneck • to evaluate advanced transport protocols (e.g. Tsunami, SABUL, HSTCP, FAST, XCP, [ours]) • Improve TCP under easier conditions • with a signle TCP stream • memory to memory • bottleneck but no cross traffic • Consume all the available bandwidth

Path a) w/o bottleneck queue Access Backbone Access Sender Receiver B1 B3 B1 <= B2 & B1 <= B3 B2 b) w/ bottleneck queue Access Backbone Access Sender Receiver B2 B1 B3 B1 > B2 || B1 > B3 bottleneck

TCP on a path with bottleneck queue overflow loss bottleneck The sender may generate burst traffic. The sender recognizes the overflow after the delay < RTT. The bottleneck may change over time.

Limiting the Sending Rate 1Gbps 20Mbps throughput Sender Receiver a) congestion 100Mbps 90Mbps throughput Sender Receiver b) congestion better!

Web100 (http://www.web100.org) • A kernel patch for monitoring/modifying TCP metrics in Linux kernel • We need to know TCP behavior to identify a problem. • Iperf (http://dast.nlanr.net/Projects/Iperf/) • TCP/UDP bandwidth measurement • bwctl (http://e2epi.internet2.edu/bwctl/) • Wrapper for iperf with authentication and scheduling

1st Step: Tuning a Host with UDP • Remove any bottlenecks on a host • CPU, Memory, Bus, OS (driver), … • Dell PowerEdge 1650 (*not enough power) • Intel Xeon 1.4GHz x1(2), Memory 1GB • Intel Pro/1000 XT onboard PCI-X (133Mhz) • Dell PowerEdge 2650 • Intel Xeon 2.8GHz x1(2), Memory 1GB • Intel Pro/1000 XT PCI-X (133Mhz) • Iperf UDP throughput 957 Mbps • GbE wire rate: headers: UDP(20B)+IP(20B)+EthernetII(38B) • Linux 2.4.26 (RedHat 9) with web100 • PE1650: TxIntDelay=0

2nd Step: Tuning a Host with TCP • Maximum socket buffer size (TCP window size) • net.core.wmem_max net.core.rmem_max (64MB) • net.ipv4.tcp_wmem net.tcp4.tcp_rmem (64MB) • Driver descriptor length • e1000: TxDescriptors=1024 RxDescriptors=256 (default) • Interface queue length • txqueuelen=100 (default) • net.core.netdev_max_backlog=300 (default) • Interface queue descriptor • fifo (default) • MTU • mtu=1500 (IP MTU) • Iperf TCP throughput 941 Mbps • GbE wire rate: headers: TCP(32B)+IP(20B)+EthernetII(38B) • Linux 2.4.26 (RedHat 9) with web100 • Web100 (incl. High Speed TCP) • net.ipv4.web100_no_metric_save=1 (do not store TCP metrics in the route cache) • net.ipv4.WAD_IFQ=1 (do not send a congestion signal on buffer full) • net.ipv4.web100_rbufmode=0 net.ipv4.web100_sbufmode=0 (disable auto tuning) • Net.ipv4.WAD_FloydAIMD=1 (HighSpeed TCP) • net.ipv4.web100_default_wscale=7 (default)

Network Diagram for TransPAC/I2 Measurement (Oct. 2003) Kashima 100km server (general) 0.1G Tokyo XP server (e-VLBI) Koganei 1G sender TransPAC PE1650 Linux 2.4.22 (RH 9) Xeon 1.4GHz Memory 1GB GbE Intel Pro/1000 XT 1G x2 2.5G 9,000km MIT Haystack Abilene 1G 10G 4,000km Washington DC Los Angeles Indianapolis Iperf UDP ~900Mbps (no loss) 10G Mark5 Linux 2.4.7 (RH 7.1) P3 1.3GHz Memory 256MB GbE SK-9843 I2 Venue 1G receiver

TransPAC/I2 #1: High Speed (60 mins)

TransPAC/I2 #2: Reno (10 mins)

TransPAC/I2 #3: High Speed (Win 12MB)

Test in a laboratory – with bottleneck PE 2650 L2SW (FES12GCF) PE 1650 Sender Receiver • #1: Reno => Reno • #2: High Speed TCP => Reno GbE/T GbE/T GbE/SX PacketSphere Bandwidth 800Mbps Buffer 256KB Delay 88 ms Loss 0 2*BDP = 16MB

Laboratory #1,#2: 800M bottleneck Reno HighSpeed

Laboratory #3,#4,#5: High Speed (Limiting) Window Size (16MB) With limited slow-start (1000) Rate Control 270 us every 10 packets With limited slow-start (1000) Cwnd Clamp (95%) With limited slow-start (100)

How to know when bottleneck changed • End host probes periodically (e.g. packet train) • Router notifies to the end host (e.g. XCP)

Another approach: enough buffer on router • At least 2xBDP (bandwidth delay product) e.g. 1G bps x 200ms x 2 = 500Mb ~ 50MB • Replace Fast SRAM with DRAM in order to reduce space and cost

Test in a laboratory – with bottleneck (2) PE 2650 L2SW (FES12GCF) PE 1650 Sender Receiver • #6: High Speed TCP => Reno GbE/T GbE/T GbE/SX Network Emulator Bandwidth 800Mbps Buffer 64MB Delay 88 ms Loss 0 2*BDP = 16MB

Laboratory #6: 800M bottleneck HighSpeed

Report on MTU • Increasing MTU (packet size) results in better performance. Standard MTU is 1500B. MTU 9KB is available throughout Abilene, TransPAC, APII backbones. • On Aug 25, 2004, a remaining link with 1500B was upgraded to 9KB in Tokyo XP. MTU 9KB is available from Busan to Los Angeles.

Current and Future Plans of e-VLBI • KOA (Korean Observatory of Astronomy) has one existing radio telescope but in a different band from ours. They are building another three radio telescopes. • Using a dedicated light path from Europe to Asia through US is being considered. • e-VLBI Demonstration in SuperComputing2004 (November) is being planned, interconnecting radio telescopes from Europe, US, and Japan. • Gigabit A/D converter ready and now implementing 10G. • Our peformance measurement infrastructure will be merged into a framework of Global (Network) Observatory maintained by NOC people. (Internet2 piPEs, APAN CMM, and e-VLBI)

Questions? • See http://www2.nict.go.jp/ka/radioastro/index.htmlfor VLBI

Masaki Hirabaru masaki@nict.go.jp NICT