410 likes | 541 Views
Bandwidth Challenges or "How fast can we really drive a Network?". Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks”. Collaboration at SC|05. Caltech Booth. The BWC at the SLAC Booth. SCINet. Storcloud. ESLEA Boston Ltd. & Peta-Cache Sun.
E N D
Bandwidth Challenges or "How fast can we really drive a Network?" Richard Hughes-Jones The University of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Collaboration at SC|05 • Caltech Booth • The BWC at the SLAC Booth • SCINet • Storcloud • ESLEA Boston Ltd. & Peta-CacheSun Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
SC2004 101 Gbit/s Bandwidth Challenge wins Hat Trick • The maximum aggregate bandwidth was >151 Gbits/s • 130 DVD movies in a minute • serve 10,000 MPEG2 HDTV movies in real-time • 22 10Gigabit Ethernet wavesCaltech & SLAC/FERMI booths • In 2 hours transferred 95.37 TByte • 24 hours moved ~ 475 TBytes • Showed real-time particle event analysis • SLAC Fermi UK Booth: • 1 10 Gbit Ethernet to UK NLR&UKLight: • transatlantic HEP disk to disk • VLBI streaming • 2 10 Gbit Links to SALC: • rootd low-latency file access application for clusters • Fibre Channel StorCloud • 4 10 Gbit links to Fermi • Dcache data transfers In to booth Out of booth Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Reverse TCP ESLEA and UKLight • 6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR • Disk-to-disk transfers with bbcp • Seattle to UK • Set TCP buffer and application to give ~850Mbit/s • One stream of data 840-620 Mbit/s • Stream UDP VLBI data • UK to Seattle • 620 Mbit/s Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
SLAC 10 Gigabit Ethernet • 2 Lightpaths: • Routed over ESnet • Layer 2 over Ultra Science Net • 6 Sun V20Z systems per λ • dcache remote disk data access • 100 processes per node • Node sends or receives • One data stream 20-30 Mbit/s • Used Netweion NICs & Chelsio TOE • Data also sent to StorCloudusing fibre channel links • Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9Gbit on Trunk Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
LightPath Topologies Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Lab to Lab Lightpaths • Many application share • Classic congestion points • TCP stream sharing and recovery • Advanced TCP stacks Switched LightPaths [1] • Lightpaths are a fixed point to point path or circuit • Optical links (with FEC) have a BER 10-16 i.e. a packet loss rate 10-12 or 1 loss in about 160 days • In SJ5 LightPaths known as Bandwidth Channels • Host to host Lightpath • One Application • No congestion • Advanced TCP stacks for large Delay Bandwidth Products Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Switched LightPaths [2] • User Controlled Lightpaths • Grid Scheduling ofCPUs & Network • Many Application flows • No congestion on each path • Lightweight framing possible • Some applications suffer when using TCP may prefer to use UDP DCCP XCP … • E.g. With e-VLBI the data wave-front gets distorted and correlation fails Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Network Transport Layer Issues Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Problem #1 Packet Loss Is it important ? Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
TCP (Reno) – Packet loss and Time 2 min • TCP takes packet loss as indication of congestion • Time for TCP to recover its throughput from 1 lost 1500 byte packet given by: • for rtt of ~200 ms @ 1 Gbit/s: UK 6 msEurope 25 msUSA 150 ms1.6 s26 s 28min Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Packet Loss and new TCP Stacks • TCP Response Function • Throughput vs Loss Rate – further to right: faster recovery • UKLight London-Chicago-London rtt 177 ms • Drop Packets in Kernel • 2.6.6 Kernel • Agreement withtheory good • Some new stacksgood at high loss rates Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Send data with TCP Drop Packets Monitor TCP with Web100 man03 lon01 High Throughput Demonstrations Chicago Geneva rtt 128 ms Dual Zeon 2.2 GHz Dual Zeon 2.2 GHz Cisco GSR Cisco GSR Cisco 7609 Cisco 7609 1 GEth 1 GEth 2.5 Gbit SDH MB-NG Core Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
TCP Throughput – DataTAG • Different TCP stacks tested on the DataTAG Network • rtt 128 ms • Drop 1 in 106 • High-Speed • Rapid recovery • Scalable • Very fast recovery • Standard • Recovery would take ~ 20 mins Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Problem #2 Is TCP fair? look at Round Trip Times & Max Transfer Unit Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
CERN (GVA) Starlight (Chi) Host #1 1 GE 1 GE Host #1 1 GE POS 2.5Gbps GbE Switch Host #2 Host #2 1 GE Bottleneck R R MTU and Fairness • Two TCP streams share a 1 Gb/s bottleneck • RTT=117 ms • MTU = 3000 Bytes ; Avg. throughput over a period of 7000s = 243 Mb/s • MTU = 9000 Bytes; Avg. throughput over a period of 7000s = 464 Mb/s • Link utilization : 70,7 % Sylvain Ravot DataTag 2003 Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
CERN (GVA) Starlight (Chi) Sunnyvale Host #1 1 GE 10GE 1 GE GbE Switch POS 2.5Gb/s POS 10Gb/s Host #2 Host #2 1 GE 1 GE Bottleneck Host #1 R R R R RTT and Fairness • Two TCP streams share a 1 Gb/s bottleneck • CERN <-> Sunnyvale RTT=181ms ; Avg. throughput over a period of 7000s = 202Mb/s • CERN <-> Starlight RTT=117ms; Avg. throughput over a period of 7000s = 514Mb/s • MTU = 9000 bytes • Link utilization = 71,6 % Sylvain Ravot DataTag 2003 Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Problem #n Do TCP Flows Share the Bandwidth ? Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
bottleneck SLAC TCP/UDP CERN Iperf or UDT iperf Ping 1/s ICMP/ping traffic 4 mins 2 mins Test of TCP Sharing: Methodology (1Gbit/s) • Chose 3 paths from SLAC (California) • Caltech (10ms), Univ Florida (80ms), CERN (180ms) • Used iperf/TCP and UDT/UDP to generate traffic • Each run was 16 minutes, in 7 regions Les Cottrell & RHJ PFLDnet 2005 Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Remaining flows do not take up slack when flow removed Increase recovery rate RTT increases when achieves best throughput Congestion has a dramatic effect Recovery is slow TCP Reno single stream Les Cottrell & RHJ PFLDnet 2005 • Low performance on fast long distance paths • AIMD (add a=1 pkt to cwnd / RTT, decrease cwnd by factor b=0.5 in congestion) • Net effect: recovers slowly, does not effectively use available bandwidth, so poor throughput • Unequal sharing SLAC to CERN Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
> 2 flows appears less stable Appears to need >1 flow to achieve best throughput Two flows share equally SLAC-CERN Hamilton TCP • One of the best performers • Throughput is high • Big effects on RTT when achieves best throughput • Flows share equally Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Problem #n+1 To SACK or not to SACK ? Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
SACKs updated rtt 150ms Standard SACKs rtt 150ms HS-TCP Dell 1650 2.8 GHz PCI-X 133 MHz Intel Pro/1000 Doug Leith Yee-Ting Li The SACK Algorithm • SACK Rational • Non-continuous blocks of data can be ACKed • Sender transmits just lost packets • Helps when multiple packets lost in one TCP window • The SACK Processing is inefficient for large bandwidth delay products • Sender write queue (linked list) walked for: • Each SACK block • To mark lost packets • To re-transmit • Processing so long input Q becomes full • Get Timeouts Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
What does the User & Application make of this? The view from the Application Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
SC2004: Disk-Disk bbftp • bbftp file transfer program uses TCP/IP • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off • Move a 2 Gbyte file • Web100 plots: • Standard TCP • Average 825 Mbit/s • (bbcp: 670 Mbit/s) • Scalable TCP • Average 875 Mbit/s • (bbcp: 701 Mbit/s~4.5s of overhead) • Disk-TCP-Disk at 1Gbit/sis here! Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
PCI-X bus with RAID Controller Read from diskfor 44 ms every 100ms PCI-X bus with Ethernet NIC Write to Network for 72 ms SC|05 HEP: Moving data with bbcp • What is the end-host doing with your network protocol? • Look at the PCI-X • 3Ware 9000 controller RAID0 • 1 Gbit Ethernet link • 2.4 GHz dual Xeon • ~660 Mbit/s • Power needed in the end hosts • Careful Application design Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
VLBI: TCP Stack & CPU Load • 1.2GHz PIII • TCP iperf rtt 1 ms 960 Mbit/s • 94.7% kernel mode idle 1.5 % • TCP iperf rtt 15 ms 777 Mbit/s • 96.3% kernel mode idle 0.05 % • CPULoad with nice priority • Throughput falls as priorityincreases • No Loss No Timeouts • Not enough CPU power • Real User problem! • End host TCP flow at 960 Mbit/s with rtt 1 ms falls to 770 Mbit/s when rtt 15 ms • 2.8 GHz Xeon rtt 1 ms • TCP iperf 916 Mbit/s • 43% kernel mode idle 55% • CPULoad with nice priority • Throughput constant as priority increases • No Loss No Timeouts • Kernel mode includes TCP stackand Ethernet driver Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
SFI and SFO Event Filter Daemon EFD Request event Send event data Request-Response time (Histogram) Process event Request Buffer Send OK Send processed event ●●● Time ATLAS Remote Computing: Application Protocol • Event Request • EFD requests an event from SFI • SFI replies with the event ~2Mbytes • Processing of event • Return of computation • EF asks SFO for buffer space • SFO sends OK • EF transfers results of the computation • tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
TCP Congestion windowgets re-set on each Request • TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity • Even after 10s, each response takes 13 rtt or ~260 ms • Transfer achievable throughput120 Mbit/s • Event rate very low • Application not happy! tcpmon: TCP Activity Manc-CERN Req-Resp • Web100 hooks for TCP status • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP in slow start • 1st event takes 19 rtt or ~ 380 ms Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
tcpmon: TCP Activity Manc-cern Req-Respno cwnd reduction • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 19 rtt or ~ 380 ms • TCP Congestion windowgrows nicely • Response takes 2 rtt after ~1.5s • Rate ~10/s (with 50ms wait) • Transfer achievable throughputgrows to 800 Mbit/s • Data transferred WHEN theapplication requires the data 3 Round Trips 2 Round Trips Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Applications able to sustain high rates. • SuperJANET5, UKLight &new access links very timely HEP: Service Challenge 4 • Objective: demo 1 Gbit/s aggregate bandwidth between RAL and 4 Tier 2 sites • RAL has SuperJANET4 and UKLight links: • RAL Capped firewall traffic at 800 Mbit/s • SuperJANET Sites: • Glasgow Manchester Oxford QMWL • UKLight Site: • Lancaster • Many concurrent transfersfrom RAL to each of the Tier 2 sites ~700 Mbit UKLight Peak 680 Mbit SJ4 Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Summary & Conclusions • Well, you CAN fill the Links at 1 and 10 Gbit/s – but its not THAT simple • Packet loss is a killer for TCP • Check on campus links & equipment, and access links to backbones • Users need to collaborate with the Campus Network Teams • Dante Pert • New stacks are stable and give better response & performance • Still need to set the TCP buffer sizes ! • Check other kernel settings e.g. window-scale maximum • Watch for“TCP Stack implementation Enhancements” • TCP tries to be fair • Large MTU has an advantage • Short distances, small RTT, have an advantage • TCP does not share bandwidth well with other streams • The End Hosts themselves • Plenty of CPU power is required for the TCP/IP stack as well and the application • Packets can be lost in the IP stack due to lack of processing power / CPU scheduling • Interaction between HW, protocol processing, and disk sub-system complex • Application architecture & implementation are also important • The TCP protocol dynamics strongly influence the behaviour of the Application. Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
More Information Some URLs • UKLight web site: http://www.uklight.ac.uk • ESLEA web site: http://www.eslea.uklight.ac.uk • MB-NG project web site:http://www.mb-ng.net/ • DataTAG project web site: http://www.datatag.org/ • UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net • Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ • TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html& http://www.psc.edu/networking/perf_tune.html • TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 • PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ • Dante PERT http://www.geant2.net/server/show/nav.00d00h002 Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Any Questions? Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
Backup Slides Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
More Information Some URLs 2 • Lectures, tutorials etc. on TCP/IP: • www.nv.cc.va.us/home/joney/tcp_ip.htm • www.cs.pdx.edu/~jrb/tcpip.lectures.html • www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS • www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm • www.cis.ohio-state.edu/htbin/rfc/rfc1180.html • www.jbmelectronics.com/tcp.htm • Encylopaedia • http://www.freesoft.org/CIE/index.htm • TCP/IP Resources • www.private.org.il/tcpip_rl.html • Understanding IP addresses • http://www.3com.com/solutions/en_US/ncs/501302.html • Configuring TCP (RFC 1122) • ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt • Assigned protocols, ports etc (RFC 1010) • http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
MB-NG Managed Bandwidth Packet Loss with new TCP Stacks • TCP Response Function • Throughput vs Loss Rate – further to right: faster recovery • Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
MB-NG Managed Bandwidth High Performance TCP – MB-NG • Drop 1 in 25,000 • rtt 6.2 ms • Recover in 1.6 s Standard HighSpeed Scalable Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
2nd flow never gets equal share of bandwidth Big drops in throughput which take several seconds to recover from SLAC-CERN Fast • As well as packet loss, FAST uses RTT to detect congestion • RTT is very stable: σ(RTT) ~ 9ms vs 37±0.14ms for the others Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
SACK … • Look into what’s happening at the algorithmic level with web100: • Strange hiccups in cwnd only correlation is SACK arrivals Scalable TCP on MB-NG with 200mbit/sec CBR Background Yee-Ting Li Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester
mmrbc 512 bytes mmrbc 1024 bytes mmrbc 2048 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update mmrbc 4096 bytes 5.7Gbit/s 10 Gigabit Ethernet: Tuning PCI-X • 16080 byte packets every 200 µs • Intel PRO/10GbE LR Adapter • PCI-X bus occupancy vs mmrbc • Measured times • Times based on PCI-X times from the logic analyser • Expected throughput ~7 Gbit/s • Measured 5.7 Gbit/s Networkshop 34 4-6 Apr 2006, R. Hughes-Jones Manchester