550 likes | 559 Views
Explore TCP/IP usage on long-distance optical networks, addressing network efficiency and bandwidth optimization techniques employed in real applications. Delve into TCP functionalities, congestion control mechanisms, and tuning for maximizing throughput over high-speed connections. Learn from practical experiences presented in a mini-symposium at the University of Manchester by Richard Hughes-Jones in August 2005.
E N D
Using TCP/IP on High Bandwidth Long Distance Optical NetworksReal Applications on Real Networks Richard Hughes-JonesUniversity of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” then look for “Rank” Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Bandwidth Challenge at SC2004 • Working with S2io, Sun, Chelsio • Setting up the BW Bunker • SCINet Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester • The BW Challenge at the SLAC Booth
The Bandwidth Challenge – SC2004 • The peak aggregate bandwidth from the booths was 101.13Gbits/s • That is 3 full length DVDs per second ! • 4 Times greater that SC2003 ! (with its 4.4 Gbit transatlantic flows) • Saturated TEN 10Gigabit Ethernet waves • SLAC Booth: Sunnyvale to Pittsburgh, LA to Pittsburgh and Chicago to Pittsburgh (with UKLight). Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
TCP has been around for ages and it just works fine So What’s the Problem? The users complain about the Network! Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Receiver Sender Segment n Sequence 1024 Length 1024 RTT ACK of Segment n Ack 2048 Segment n+1 Sequence 2048 Length 1024 RTT ACK of Segment n +1 Ack 3072 Time TCP – provides reliability • Positive acknowledgement (ACK) of each received segment • Sender keeps record of each segment sent • Sender awaits an ACK – “I am ready to receive byte 2048 and beyond” • Sender starts timer when it sends segment – so can re-transmit • Inefficient – sender has to wait Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
TCP Cwnd slides Data to be sent, waiting for window to open. Application writes here Unsent Data may be transmitted immediately Data sent and ACKed Sent Data buffered waiting ACK Receiver’s advertised window advances leading edge Sending host advances marker as data transmitted Received ACK advances trailing edge Flow Control: Sender – Congestion Window • Uses Congestion window, cwnd, a sliding window to control the data flow • Byte count giving highest byte that can be sent with out without an ACK • Transmit buffer size and Advertised Receive buffer size important. • ACK gives next sequence no to receive ANDThe available space in the receive buffer • Timer kept for each packet Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
packet loss timeout CWND slow start: exponential increase retransmit: slow start again congestion avoidance: linear increase time How it works: TCP Slowstart • Probe the network - get a rough estimate of the optimal congestion window size • The larger the window size, the higher the throughput • Throughput = Window size / Round-trip Time • exponentially increase the congestion window size until a packet is lost • cwnd initially 1 MTU then increased by 1 MTU for each ACK received • Send 1st packet get 1 ACK increase cwnd to 2 • Send 2 packets get 2 ACKs inc cwnd to 4 • Time to reach cwnd size W = RTT*log2(W) • Rate doubles each RTT Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
packet loss timeout CWND slow start: exponential increase retransmit: slow start again congestion avoidance: linear increase time How it works: TCP AIMD Congestion Avoidance • additive increase: starting from the rough estimate, linearly increase the congestion window size to probe for additional available bandwidth • cwnd increased by 1 /MTU for each ACK – linear increase in rate cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • TCP takes packet loss as indication of congestion ! • multiplicative decrease: cut the congestion window size aggressively if a packet is lost • Standard TCP reduces cwnd by 0.5 cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ • Slow start to Congestion avoidance transition determined by ssthresh • Packet loss is a killer Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
TCP (Reno) – Details of problem • The time for TCP to recover its throughput from 1 lost 1500 byte packet is given by: • for rtt of ~200 ms: 2 min UK 6 msEurope 25 msUSA 150 ms1.6 s26 s 28min Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Receiver Sender RTT ACK Segment time on wire = bits in segment/BW Time TCP: Simple Tuning - Filling the Pipe • Remember, TCP has to hold a copy of data in flight • Optimal (TCP buffer) window size depends on: • Bandwidth end to end, i.e. min(BWlinks) AKA bottleneck bandwidth • Round Trip Time (RTT) • The number of bytes in flight to fill the entire path: • Bandwidth*Delay Product BDP = RTT*BW • Can increase bandwidth by orders of magnitude • Windows also used for flow control Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Investigation of new TCP Stacks • The AIMD Algorithm – Standard TCP (Reno) • For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ • High Speed TCP a and b vary depending on current cwnd using a table • a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path • b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. • Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd • a = 1/100 – the increase is greater than TCP Reno • b = 1/8 – the decrease on loss is less than TCP Reno • Scalable over any link speed. • Fast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput. • HSTCP-LP, Hamilton-TCP, BiC-TCP Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Lets Check out this theory about new TCP stacks Does it matter ? Does it work? Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
MB-NG Managed Bandwidth Packet Loss with new TCP Stacks • TCP Response Function • Throughput vs Loss Rate – further to right: faster recovery • Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Send data with TCP Drop Packets Monitor TCP with Web100 man03 lon01 High Throughput Demonstration London (Chicago) Manchester (Geneva) Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz Cisco GSR Cisco GSR Cisco 7609 Cisco 7609 1 GEth 1 GEth 2.5 Gbit SDH MB-NG Core Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
High Performance TCP – DataTAG • Different TCP stacks tested on the DataTAG Network • rtt 128 ms • Drop 1 in 106 • High-Speed • Rapid recovery • Scalable • Very fast recovery • Standard • Recovery would take ~ 20 mins Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Throughput for real users Transfers in the UK for BaBar using MB-NG and SuperJANET4 Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Manchester Domain UKERNA DevelopmentNetwork UCL Domain Boundary Router Cisco 7609 Boundary Router Cisco 7609 lon02 lon01 lon03 HW RAID HW RAID Edge Router Cisco 7609 MB-NG ral02 ral02 RAL Domain Managed Bandwidth Key Gigabit Ethernet 2.5 Gbit POS Access MPLS Admin. Domains ral01 Boundary Router Cisco 7609 man02 man03 man01 Topology of the MB – NG Network Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
HW RAID HW RAID ral01 man01 Topology of the Production Network Manchester Domain 3 routers2 switches RAL Domain routers switches Key Gigabit Ethernet 2.5 Gbit POS Access 10 Gbit POS Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
BaBar on Production network • Standard TCP • 425 Mbit/s • DupACKs 350-400 – re-transmits iperf Throughput + Web100 • SuperMicro on MB-NG network • HighSpeed TCP • Linespeed 940 Mbit/s • DupACK ? <10 (expect ~400) Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Applications: Throughput Mbit/s HighSpeed TCP 2 GByte file RAID5 SuperMicro + SuperJANET • bbcp • bbftp • Apachie • Gridftp • Previous work used RAID0(not disk limited) Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
BaBar + SuperJANET • Instantaneous 200 – 600 Mbit/s • Disk-mem~ 590 Mbit/s bbftp: What else is going on? Scalable TCP • SuperMicro + SuperJANET • Instantaneous 220 - 625 Mbit/s • Congestion window – duplicate ACK • Throughput variation not TCP related? • Disk speed / bus transfer • Application Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Rate decreases New stacksgive morethroughput Average Transfer Rates Mbit/s Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Transatlantic Disk to Disk Transfers With UKLight SuperComputing 2004 Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
MB-NG Managed Bandwidth Amsterdam SC2004 UKLIGHT Overview SLAC Booth SC2004 Cisco 6509 MB-NG 7600 OSR Manchester Caltech Booth UltraLight IP UCL network UCL HEP NLR Lambda NLR-PITT-STAR-10GE-16 ULCC UKLight K2 K2 Ci UKLight 10G Four 1GE channels Ci Caltech 7600 UKLight 10G Surfnet/ EuroLink 10G Two 1GE channels Chicago Starlight K2 Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Transatlantic Ethernet: TCP Throughput Tests • Supermicro X5DPE-G2 PCs • Dual 2.9 GHz Xenon CPU FSB 533 MHz • 1500 byte MTU • 2.6.6 Linux Kernel • Memory-memory TCP throughput • Standard TCP • Wire rate throughput of 940 Mbit/s • First 10 sec • Work in progress to study: • Implementation detail • Advanced stacks • Effect of packet loss • Sharing Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
SC2004 Disk-Disk bbftp • bbftp file transfer program uses TCP/IP • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off • Move a 2 Gbyte file • Web100 plots: • Standard TCP • Average 825 Mbit/s • (bbcp: 670 Mbit/s) • Scalable TCP • Average 875 Mbit/s • (bbcp: 701 Mbit/s~4.5s of overhead) • Disk-TCP-Disk at 1Gbit/s Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
% CPU kernel mode Network & Disk Interactions (work in progress) • Hosts: • Supermicro X5DPE-G2 motherboards • dual 2.8 GHz Zeon CPUs with 512 k byte cache and 1 M byte memory • 3Ware 8506-8 controller on 133 MHz PCI-X bus configured as RAID0 • six 74.3 GByte Western Digital Raptor WD740 SATA disks 64k byte stripe size • Measure memory to RAID0 transfer rates with & without UDP traffic Disk write 1735 Mbit/s Disk write + 1500 MTU UDP 1218 Mbit/s Drop of 30% Disk write + 9000 MTU UDP 1400 Mbit/s Drop of 19% Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Remote Computing Farms in the ATLAS TDAQ Experiment Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
ROB ROB ROB ROB PF PF Data Collection Network PF PF SFI SFI SFI L2PU L2PU Back End Network L2PU PF L2PU PF SFOs Switch Remote Computing Concepts Remote Event Processing Farms Copenhagen Edmonton Krakow Manchester ~PByte/sec ATLAS Detectors – Level 1 Trigger Event Builders lightpaths GÉANT Level 2 Trigger Local Event Processing Farms CERN B513 320 MByte/sec Mass storage Experimental Area Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
SFI and SFO Event Filter Daemon EFD Request event Send event data Request-Response time (Histogram) Process event Request Buffer Send OK Send processed event ●●● Time ATLAS Application Protocol • Event Request • EFD requests an event from SFI • SFI replies with the event ~2Mbytes • Processing of event • Return of computation • EF asks SFO for buffer space • SFO sends OK • EF transfers results of the computation • tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
TCP Congestion windowgets re-set on each Request • TCP stack implementation detail to reduce Cwnd after inactivity • Even after 10s, each response takes 13 rtt or ~260 ms • Transfer achievable throughput120 Mbit/s tcpmon: TCP Activity Manc-CERN Req-Resp • Web100 Instruments the TCP stack • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP in slow start • 1st event takes 19 rtt or ~ 380 ms Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
tcpmon: TCP Activity Manc-cern Req-RespTCP stack tuned • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 19 rtt or ~ 380 ms • TCP Congestion windowgrows nicely • Response takes 2 rtt after ~1.5s • Rate ~10/s (with 50ms wait) • Transfer achievable throughputgrows to 800 Mbit/s Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
tcpmon: TCP Activity Alberta-CERN Req-RespTCP stack tuned • Round trip time 150 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 11 rtt or ~ 1.67 s • TCP Congestion windowin slow start to ~1.8s then congestion avoidance • Response in 2 rtt after ~2.5s • Rate 2.2/s (with 50ms wait) • Transfer achievable throughputgrows slowly from 250 to 800 Mbit/s Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Time Series of Request-Response Latency • Manchester – CERN • Round trip time 20 ms • 1 Mbyte of data returned • Stable for ~18s at ~42.5ms • Then alternate points 29 & 42.5 ms • Alberta – CERN • Round trip time 150 ms • 1 Mbyte of data returned • Stable for ~150s at 300ms • Falls to 160ms with ~80 μs variation Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Radio Astronomy e-VLBI Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Dwingeloo DWDM link Jodrell BankUK MedicinaItaly TorunPoland e-VLBI at the GÉANT2 Launch Jun 2005 Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
e-VLBI UDP Data Streams Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
UDP Performance: 3 Flows on GÉANT • Throughput:5 Hour run • Jodrell: JIVE2.0 GHz dual Xeon – 2.4 GHz dual Xeon670-840 Mbit/s • Medicina (Bologna):JIVE 800 MHz PIII – mark6231.2 GHz PIII330 Mbit/s limited by sending PC • Torun:JIVE 2.4 GHz dual Xeon – mark5751.2 GHz PIII245-325 Mbit/s limited by security policing (>400Mbit/s 20 Mbit/s) ? • Throughput:50 min period • Period is ~17 min Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
UDP Performance: 3 Flows on GÉANT • Packet Loss & Re-ordering • Jodrell: 2.0 GHz Xeon • Loss 0 – 12% • Reordering significant • Medicina: 800 MHz PIII • Loss ~6% • Reordering in-significant • Torun: 2.4 GHz Xeon • Loss 6 - 12% • Reordering in-significant Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
18 Hour Flows on UKLightJodrell – JIVE, 26 June 2005 • Throughput: • Jodrell: JIVE2.4 GHz dual Xeon – 2.4 GHz dual Xeon960-980 Mbit/s • Traffic through SURFnet • Packet Loss • Only 3 groups with 10-150 lost packets each • No packets lost the rest of the time • Packet re-ordering • None Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
MB - NG Summary & Conclusions • The End Hosts themselves • The performance of Motherboards, NICs, RAID controllers and Disks matter • Plenty of CPU power is required to sustain Gigabit transfers for the TCP/IP stack as well and the application • Packets can be lost in the IP stack due to lack of processing power • New TCP stacks • are stable give better response & performance • Still need to set the tcp buffer sizes ! • Check other kernel settings e.g. window-scale • Take care on difference between the Protocol and The Implementation • Packet loss is a killer • Check on campus links & equipment, and access links to backbones • Applications • architecture & implementation is also important • The work is applicable to other areas including: • Remote iSCSI • Remote database accesses • Real-time Grid Computing – eg Real-Time Interactive Medical Image processing • Interaction between HW, protocol processing, and disk sub-system complex Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
More Information Some URLs • Real-Time Remote Farm site http://csr.phys.ualberta.ca/real-time • UKLight web site: http://www.uklight.ac.uk • DataTAG project web site: http://www.datatag.org/ • UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/ (Software & Tools) • Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ (Publications) • TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html& http://www.psc.edu/networking/perf_tune.html • TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004http:// www.hep.man.ac.uk/~rich/ (Publications) • PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ • Dante PERT http://www.geant2.net/server/show/nav.00d00h002 Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Any Questions? Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Backup Slides Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Multi-Gigabit flows at SC2003 BW Challenge • Three Server systems with 10 Gigabit Ethernet NICs • Used the DataTAG altAIMD stack 9000 byte MTU • Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to: • Pal Alto PAIX • rtt 17 ms , window 30 MB • Shared with Caltech booth • 4.37 Gbit HighSpeed TCP I=5% • Then 2.87 Gbit I=16% • Fall when 10 Gbit on link • 3.3Gbit Scalable TCP I=8% • Tested 2 flows sum 1.9Gbit I=39% • Chicago Starlight • rtt 65 ms , window 60 MB • Phoenix CPU 2.2 GHz • 3.1 Gbit HighSpeed TCP I=1.6% • Amsterdam SARA • rtt 175 ms , window 200 MB • Phoenix CPU 2.2 GHz • 4.35 Gbit HighSpeed TCP I=6.9% • Very Stable • Both used Abilene to Chicago Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Latency Measurements • UDP/IP packets sent between back-to-back systems • Processed in a similar manner to TCP/IP • Not subject to flow control & congestion avoidance algorithms • Used UDPmon test program • Latency • Round trip times measured using Request-Response UDP frames • Latency as a function of frame size • Slope is given by: • Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) • Intercept indicates: processing times + HW latencies • Histograms of ‘singleton’ measurements • Tells us about: • Behavior of the IP stack • The way the HW operates • Interrupt coalescence Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Sender Receiver Zero stats OK done Send data frames at regular intervals Inter-packet time (Histogram) ●●● ●●● Time to receive Time to send Get remote statistics Send statistics: No. received No. lost + loss pattern No. out-of-order CPU load & no. int 1-way delay Signal end of test OK done Time Number of packets n bytes time Wait time Throughput Measurements • UDP Throughput • Send a controlled stream of UDP frames spaced at regular intervals Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
Gigabit Ethernet Probe CPU CPU NIC NIC PCI bus PCI bus chipset chipset mem mem Logic Analyser Display Possible Bottlenecks PCI Bus & Gigabit Ethernet Activity • PCI Activity • Logic Analyzer with • PCI Probe cards in sending PC • Gigabit Ethernet Fiber Probe Card • PCI Probe cards in receiving PC Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
“Server Quality” Motherboards • SuperMicro P4DP8-2G (P4DP6) • Dual Xeon • 400/522 MHz Front side bus • 6 PCI PCI-X slots • 4 independent PCI buses • 64 bit 66 MHz PCI • 100 MHz PCI-X • 133 MHz PCI-X • Dual Gigabit Ethernet • Adaptec AIC-7899W dual channel SCSI • UDMA/100 bus master/EIDE channels • data transfer rates of 100 MB/sec burst Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester
“Server Quality” Motherboards • Boston/Supermicro H8DAR • Two Dual Core Opterons • 200 MHz DDR Memory • Theory BW: 6.4Gbit • HyperTransport • 2 independent PCI buses • 133 MHz PCI-X • 2 Gigabit Ethernet • SATA • ( PCI-e ) Mini-Symposium on Optical Data Networking, August 2005, R. Hughes-Jones Manchester