Transmission Rate Controlled TCP in Data Reservoir - Software control approach - Mary Inaba

Transmission Rate Controlled TCPin Data Reservoir- Software control approach -Mary Inaba University of Tokyo Fujitsu Laboratories Fujitsu Computer Technologies

Data intensive scientific computation through global networks X-ray astronomy Satellite ASUKA Nuclear experiments Nobeyama Radio Observatory （VLBI) BelleExperiments Data Reservoir Very　High-speed Network Digital Sky Survey Distributed Shared files Data Reservoir SUBARU Telescope Data Reservoir Local Accesses Grape6 Data analysis at University of Tokyo

Research Projects with Data Reservoir

Dream Computing System for real Scientists • Fast CPU, huge memory and disks, good graphics • Cluster technology, DSM technology, Graphics processors • Grid technology • Very fast remote file accesses • Global file system, data parallel file systems, Replication facilities • Transparency to local computation • No complex middleware, no or small modification to existing software • Real Scientists are not computer scientists • Computer scientists are not work forces for real scientists

Objectives of Data Reservoir • Sharing Scientific Data between distant research institutes • Physics, astronomy, earth science, simulation data • Very High-speed single file transfer on Long Fat pipe Network • > 10 Gbps, > 20,000 Km (12,500 miles), > 400ms RTT • High utilization of available bandwidth • Transferred file data rate > 90% of available bandwidth • Including header overheads, initial negotiation overheads • OS and File system transparency • Storage level data sharing (high speed iSCSI protocol on stock TCP) • Fast single file transfer

Basic Architecture High latency Very high bandwidth Network Data Reservoir Disk-block level Parallel and Multi-stream transfer Local file accesses Cache Disks Data Reservoir Distribute　Shared　Data (DSM like architecture) Local file accesses Cache Disks

Data Reservoir Features • Data sharing in low-level protocol • Use of iSCSI protocol • Efficient data transfer (optimization of disk head movements) • File system transparency • Single file image • Multi-level striping for performance scalability • Local file accesses through LAN • Global disk transfer through WAN Unified by iSCSI protocol

File accesses on Data Reservoir Scientific Detectors User Programs 1st level striping File Server File Server File Server File Server Disk access by iSCSI IP Switch IP Switch 2nd level striping Disk Server Disk Server Disk Server Disk Server IBM x345 (2.6GHz x 2)

Scientific Detectors User Programs File Server File Server File Server File Server iSCSI Bulk Transfer IP Switch IP Switch Global Network Disk Server Disk Server Disk Server Disk Server Global Data Transfer

BW behavior Data Reservoir Transfer through A file system Bandwidth(Mbps) Bandwidth(Mbps) Time (sec) Time (sec)

Problems of BWC2002 experiments • Low TCP bandwidth due to packet losses • TCP congestion window size control • Very slow recovery from fast recovery phase (>20min) • Unbalance among parallel iSCSI streams • Packet scheduling by switches and routers • User and other network users have interests only to total behavior of parallel TCP streams

Fast Ethernet vs. GbE • Iperf in 30 seconds • Min/Avg: Fast Ethernet > GbE FE GbE

Packet Transmission Rate • Bursty behavior • Transmission in 20ms against RTT 200ms • Idle in rest 180ms Packet loss occurred

Packet Spacing • Ideal Story • Transmitting packet every RTT/cwnd • 24μs interval for 500Mbps (MTU 1500B) • High load for software only • Low overhead because of limited use at slow start phase RTT RTT/cwnd

Example Case of 8 IPG • Success on Fast Retransmit • Smooth Transition to Congestion Avoidance • CA takes 28 minutes to recover to 550Mbps

Best Case of 1023B IPG • Like Fast Ethernet case • Proper transmission rate • Spurious Retransmit due to Reordering

Performance Divergence on LFN • Parallel streams • Difference grows adversely • Slowest stream determines total performance

Unbalance within parallel TCP streams • Unbalance among parallel iSCSI streams • Packet scheduling by switches and routers • Meaningless unfairness among parallel streams • User and other network users have interests only to total behavior of parallel TCP streams • Our approach • Constant Σcwnd i for fair TCP network usage to other users • Balance each cwnd i communicating between parallel TCP streams BW BW time time

BW2003 US-Japan experiments • 24000 km (15,000 miles) distance (～400ms RTT) • Phoenix →Tokyo→Portland →Tokyo OC-48 x 3 OC-192 OC-192 GbE x 1 • Transfer ～１TB file • 32 servers, 128 iSCSI disks DR DR 10G Ether x 2 10G Ether GbE x 4 OC-48 x 2 Phoenix Seattle Tokyo Tokyo Chicago Portland L.A. OC-48 N.Y. OC-192 OC-192 GbE Abilene IEEAF/ WIDE IEEAF/ WIDE NTT Com, APAN, SUPER-SINET

24,000km(15,000miles) OC-48 x 3 GbE x 4 OC-192 15,680km(9,800miles) 8,320km (5,200miles) Juniper T320

SC2002 BWC2002 560Mbps (200ms RTT) 95% Utilization of available bandwidth U. of Tokyo ⇔　Scinet (Maryland, USA) ⇒Data Reservoir can saturate 10Gbps network when it will be available for US-JAPAN connection

Results • BWC2002 • Tokyo → Baltimore 10,800km(6,700miles) • Peak bandwidth (on network) 600 Mbps • Average file transfer bandwidth 560 Mbps • Bandwidth-distance products 6,048terabit-kilometers/second • BWC results (pre-test) • Phoenix → Tokyo → Portland → Tokyo 24,000 km (15,000 miles) • Peak bandwidth (on network) > (8 Gbps) • Average file transfer bandwidth> (7 Gbps) • Bandwidth-distance products > (168 petabit-kilometers/second) • More than 25 times improvement from BWC2002 performance (bandwidth-distance products)

Bad News • Network cut-down on 11/8 • US-Japan north route connection has been completely out of order • 2～3 weeks are necessary to repair the under-sea fibers. • Planned BW = 11.2 Gbps (OC48 x 3 + GbE x 4) Actual maximum BW ≒ 8.2 Gbps (OC48 x 3 + GbE x 1)

How your science benefits from high performance, high bandwidth networking • Easy and transparent access to remote scientific data • Without special programming (normal NFS style accesses) • Purely software approach with IA servers • Utilization of high-BW network for his data • 17 minutes for 1TB file transfer from the opposite location on earth • High utilization factor (> 90%) • Good for both scientists and network agencies • Scientists can concentrate on his research topics • Good for both Scientists and Computer Scientists

Summary • The most distant data transfer at BWC2003 (24,000 km) • Software techniques for improving efficiency and stability • Transfer Rate Control on TCP • CWND balancing on parallel TCP • Based on stock TCP algorithm • Possibly highest bandwidth-distance productsfor file transfer between two points • Still high utilization of available bandwidth

BWC 2003 Experiment is supported by NTT / VERIO

Transmission Rate Controlled TCP in Data Reservoir - Software control approach - Mary Inaba

Transmission Rate Controlled TCP in Data Reservoir - Software control approach - Mary Inaba

Presentation Transcript

TCP : Transmission Control Protocol

Transmission Control Protocol (TCP)

Reliable Data Transfer in Transmission Control Protocol (TCP)

TCP: Transmission Control Protocol

The Transmission Control Protocol (TCP)

Transmission Control Protocol (TCP)

TCP Transmission Control Protocol

Transmission Control Protocol (TCP)

The Transmission Control Protocol (TCP)

Transmission Control Protocol (TCP)

Transmission Control Protocol (TCP)

Transmission Control Protocol - TCP

Lab9 TCP (Transmission Control Protocol)

TCP: Transmission Control Protocol

Chapter 5 TCP Transmission Control

Transmission Control Protocol (TCP)