440 likes | 592 Views
Protocols Recent and Current Work. Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks”. Outline. SC|05 TCP and UDP memory-2-memory & disk-2-disk flows 10 Gbit Ethernet VLBI Jodrell Mark5 problem – see Matt’s Talk
E N D
ProtocolsRecent and Current Work. Richard Hughes-Jones The University of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Outline • SC|05 • TCP and UDP memory-2-memory & disk-2-disk flows • 10 Gbit Ethernet • VLBI • Jodrell Mark5 problem – see Matt’s Talk • Data delay on a TCP link – How suitable is TCP? • 4th Year MPhys Project Stephen Kershaw & James Keenan • Throughput on the 630Mbit JB-JIVE UKLight Link • 10 Gbit in FABRIC • ATLAS • Network tests on Manchester T2 farm • The Manc-Lanc UKLight Link • ATLAS Remote Farms • RAID Tests • HEP server 8 lane PCIe RAID card ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Collaboration at SC|05 • Caltech Booth • The BWC at the SLAC Booth • SCINet • Storcloud • ESLEA Boston Ltd. & Peta-CacheSun ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
SC2004 101 Gbit/s Bandwidth Challenge wins Hat Trick • The maximum aggregate bandwidth was >151 Gbits/s • 130 DVD movies in a minute • serve 10,000 MPEG2 HDTV movies in real-time • 22 10Gigabit Ethernet wavesCaltech & SLAC/FERMI booths • In 2 hours transferred 95.37 TByte • 24 hours moved ~ 475 TBytes • Showed real-time particle event analysis • SLAC Fermi UK Booth: • 1 10 Gbit Ethernet to UK NLR&UKLight: • transatlantic HEP disk to disk • VLBI streaming • 2 10 Gbit Links to SALC: • rootd low-latency file access application for clusters • Fibre Channel StorCloud • 4 10 Gbit links to Fermi • Dcache data transfers In to booth Out of booth ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Reverse TCP ESLEA and UKLight • 6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR • Disk-to-disk transfers with bbcp • Seattle to UK • Set TCP buffer and application to give ~850Mbit/s • One stream of data 840-620 Mbit/s • Stream UDP VLBI data • UK to Seattle • 620 Mbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
SLAC 10 Gigabit Ethernet • 2 Lightpaths: • Routed over ESnet • Layer 2 over Ultra Science Net • 6 Sun V20Z systems per λ • dcache remote disk data access • 100 processes per node • Node sends or receives • One data stream 20-30 Mbit/s • Used Netweion NICs & Chelsio TOE • Data also sent to StorCloudusing fibre channel links • Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9 Gbit on Trunk ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
VLBI Work TCP Delay and VLBI Transfers Manchester 4th Year MPhys Project by Stephen Kershaw & James Keenan ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
VLBI Network Topology ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Timestamp3 Timestamp4 Packet loss Timestamp5 Receiver Sender Data3 Data4 RTT ACK Segment time on wire = bits in segment/BW Time • Remember Bandwidth*Delay Product BDP = RTT*BW VLBI Application Protocol TCP & Network Sender Receiver Timestamp1 Timestamp2 Data1 Data2 ●●● • VLBI data is Constant Bit Rate • tcpdelay • instrumented TCP program emulates sending CBR Data. • Records relative 1-way delay Time ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Send time sec 1 sec Message number Check the Send Time Send time – 10000 packets • 10,000 Messages • Message size: 1448 Bytes • Wait time: 0 • TCP buffer 64k • Route:Man-ukl-JIVE-prod-Man • RTT ~26 ms • Slope 0.44 ms/message • From TCP buffer size & RTT Expect ~42 messages/RTT~0.6ms/message ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
26 messages About 25 us One rtt Message 76 Send time sec Message 102 100 ms Message number Send Time Detail • TCP Send Buffer limited • After SlowStart Buffer full • packets sent out in burstseach RTT • Program blocked on sendto() ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
1 way delay 100 ms 100 ms Message number 1-Way Delay 1 way delay – 10000 packets • 10,000 Messages • Message size: 1448 Bytes • Wait time: 0 • TCP buffer 64k • Route:Man-ukl-JIVE-prod-Man • RTT ~26 ms ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
= 1 x RTT 26 ms 1 way delay 10 ms = 1.5 x RTT 10 ms ≠ 0.5 x RTT Message number 1-Way Delay Detail • Why not just 1 RTT? • After SlowStart TCP Buffer Full • Messages at front of TCP Send Buffer have to wait for next burst of ACKs – 1 RTT later • Messages further back in the TCP Send Buffer wait for 2 RTT ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
5 ms 1-Way Delay with packet drop • Route:LAN gig8-gig1 • Ping 188 μs • 10,000 Messages • Message size: 1448 Bytes • Wait times: 0 μs • Drop 1 in 1000 • Manc-JIVE tests showtimes increasing with a “saw-tooth” around 10 s 1 way delay 10 ms Message number 28 ms 800 us ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
10 Gbit in FABRIC ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
FABRIC 4Gbit Demo • 4 Gbit Lightpath Between GÉANT PoPs • Collaboration with Dante • Continuous (days) Data Flows – VLBI_UDP and multi-Gigabit TCP tests ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Data Transfer CSR Access 2.8us 10 Gigabit Ethernet: UDP Data transfer on PCI-X • Sun V20z 1.8GHz to2.6 GHz Dual Opterons • Connect via 6509 • XFrame II NIC • PCI-X mmrbc 2048 bytes66 MHz • One 8000 byte packets • 2.8us for CSRs • 24.2 us data transfereffective rate 2.6 Gbit/s • 2000 byte packet, wait 0us • ~200ms pauses • 8000 byte packet, wait 0us • ~15ms between data blocks ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
ATLAS ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
ESLEA: ATLAS on UKLight • 1 Gbit Lightpath Lancaster-Manchester • Disk 2 Disk Transfers • Storage Element with SRM using distributed disk pools dCache & xrootd ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Send times • Pause 695 μs every 1.7ms • So expect ~600 Mbit/s • Receive times (Manc end) • No corresponding gaps udpmon: Lanc-Manc Throughput • Lanc Manc • Plateau ~640 Mbit/s wire rate • No packet Loss • Manc Lanc • ~800 Mbit/s but packet loss ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
udpmon: Manc-Lanc Throughput • Manc Lanc • Plateau ~890 Mbit/s wire rate • Packet Loss • Large frames 10% when at line rate • Small frames 60% when at line rate • 1way delay ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
SFI and SFO Event Filter Daemon EFD Request event Send event data Request-Response time (Histogram) Process event Request Buffer Send OK Send processed event ●●● Time ATLAS Remote Computing: Application Protocol • Event Request • EFD requests an event from SFI • SFI replies with the event ~2Mbytes • Processing of event • Return of computation • EF asks SFO for buffer space • SFO sends OK • EF transfers results of the computation • tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
TCP Congestion windowgets re-set on each Request • TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity • Even after 10s, each response takes 13 rtt or ~260 ms • Transfer achievable throughput120 Mbit/s • Event rate very low • Application not happy! tcpmon: TCP Activity Manc-CERN Req-Resp • Web100 hooks for TCP status • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP in slow start • 1st event takes 19 rtt or ~ 380 ms ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
tcpmon: TCP Activity Manc-cern Req-Respno cwnd reduction • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 19 rtt or ~ 380 ms • TCP Congestion windowgrows nicely • Response takes 2 rtt after ~1.5s • Rate ~10/s (with 50ms wait) • Transfer achievable throughputgrows to 800 Mbit/s • Data transferred WHEN theapplication requires the data 3 Round Trips 2 Round Trips ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Recent RAID Tests Manchester HEP Server ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
“Server Quality” Motherboards • Boston/Supermicro H8DCi • Two Dual Core Opterons • 1.8 GHz • 550 MHz DDR Memory • HyperTransport • Chipset: nVidia nForce Pro 2200/2050 • AMD 8132 PCI-X Bridge • PCI • 2 16 lane PCIe buses • 1 4 lane PCIe • 133 MHz PCI-X • 2 Gigabit Ethernet • SATA ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Disk_test: • areca PCI-Express 8 port • Maxtor 300 GB Sata disks • RAID0 5 disks • Read 2.5 Gbit/s • Write 1.8 Gbit/s • RAID5 5 data disks • Read 1.7 Gbit/s • Write 1.48 Gbit/s • RAID6 5 data disks • Read 2.1 Gbit/s • Write 1.0 Gbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Any Questions? ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
More Information Some URLs 1 • UKLight web site: http://www.uklight.ac.uk • MB-NG project web site:http://www.mb-ng.net/ • DataTAG project web site: http://www.datatag.org/ • UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net • Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ • TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html& http://www.psc.edu/networking/perf_tune.html • TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 • PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ • Dante PERT http://www.geant2.net/server/show/nav.00d00h002 ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
More Information Some URLs 2 • Lectures, tutorials etc. on TCP/IP: • www.nv.cc.va.us/home/joney/tcp_ip.htm • www.cs.pdx.edu/~jrb/tcpip.lectures.html • www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS • www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm • www.cis.ohio-state.edu/htbin/rfc/rfc1180.html • www.jbmelectronics.com/tcp.htm • Encylopaedia • http://www.freesoft.org/CIE/index.htm • TCP/IP Resources • www.private.org.il/tcpip_rl.html • Understanding IP addresses • http://www.3com.com/solutions/en_US/ncs/501302.html • Configuring TCP (RFC 1122) • ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt • Assigned protocols, ports etc (RFC 1010) • http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Backup Slides ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
SuperComputing ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
SC2004: Disk-Disk bbftp • bbftp file transfer program uses TCP/IP • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off • Move a 2 Gbyte file • Web100 plots: • Standard TCP • Average 825 Mbit/s • (bbcp: 670 Mbit/s) • Scalable TCP • Average 875 Mbit/s • (bbcp: 701 Mbit/s~4.5s of overhead) • Disk-TCP-Disk at 1Gbit/sis here! ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
PCI-X bus with RAID Controller Read from diskfor 44 ms every 100ms PCI-X bus with Ethernet NIC Write to Network for 72 ms SC|05 HEP: Moving data with bbcp • What is the end-host doing with your network protocol? • Look at the PCI-X • 3Ware 9000 controller RAID0 • 1 Gbit Ethernet link • 2.4 GHz dual Xeon • ~660 Mbit/s • Power needed in the end hosts • Careful Application design ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
10 Gigabit Ethernet: UDP Throughput • 1500 byte MTU gives ~ 2 Gbit/s • Used 16144 byte MTU max user length 16080 • DataTAG Supermicro PCs • Dual 2.2 GHz Xenon CPU FSB 400 MHz • PCI-X mmrbc 512 bytes • wire rate throughput of 2.9 Gbit/s • CERN OpenLab HP Itanium PCs • Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz • PCI-X mmrbc 4096 bytes • wire rate of 5.7 Gbit/s • SLAC Dell PCs giving a • Dual 3.0 GHz Xenon CPU FSB 533 MHz • PCI-X mmrbc 4096 bytes • wire rate of 5.4 Gbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
mmrbc 512 bytes mmrbc 1024 bytes mmrbc 2048 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update mmrbc 4096 bytes 5.7Gbit/s 10 Gigabit Ethernet: Tuning PCI-X • 16080 byte packets every 200 µs • Intel PRO/10GbE LR Adapter • PCI-X bus occupancy vs mmrbc • Measured times • Times based on PCI-X times from the logic analyser • Expected throughput ~7 Gbit/s • Measured 5.7 Gbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Data Transfer CSR Access 10 Gigabit Ethernet: TCP Data transfer on PCI-X • Sun V20z 1.8GHz to2.6 GHz Dual Opterons • Connect via 6509 • XFrame II NIC • PCI-X mmrbc 4096 bytes66 MHz • Two 9000 byte packets b2b • Ave Rate 2.87 Gbit/s • Burst of packets length646.8 us • Gap between bursts 343 us • 2 Interrupts / burst ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
TCP on the 630 Mbit Link Jodrell – UKLight – JIVE ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
TCP Throughput on 630 Mbit UKLight • Manchester gig7 – JBO mk5 606 • 4 Mbyte TCP buffer • test 0 • Dup ACKs seen • Other Reductions • test 1 • test 2 ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Comparison of Send Time & 1-way delay 26 messages Message 102 Message 76 Send time sec 100 ms Message number ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
1-Way Delay 1448 byte msg • Route:Man-ukl-ams-prod-man • Rtt 27ms • 10,000 Messages • Message size: 1448 Bytes • Wait times: 0 μs • DBP = 3.4MByte • TCP buffer 10MByte 50 ms Message number • Web100 plot • Starts after 5.6 Secdue to Clock Sync. • ~400 pkts/10ms • Rate similar to iperf ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Related Work: RAID, ATLAS Grid • RAID0 and RAID5 tests • 4th Year MPhys project last semester • Throughput and CPU load • Different RAID parameters • Number of disks • Stripe size • User read / write size • Different file systems • Ext2 ext3 XSF • Sequential File Write, Read • Sequential File Write, Read with continuous background read or write • Status • Need to check some results & document • Independent RAID controller tests planned. ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Applications able to sustain high rates. • SuperJANET5, UKLight &new access links very timely HEP: Service Challenge 4 • Objective: demo 1 Gbit/s aggregate bandwidth between RAL and 4 Tier 2 sites • RAL has SuperJANET4 and UKLight links: • RAL Capped firewall traffic at 800 Mbit/s • SuperJANET Sites: • Glasgow Manchester Oxford QMWL • UKLight Site: • Lancaster • Many concurrent transfersfrom RAL to each of the Tier 2 sites ~700 Mbit UKLight Peak 680 Mbit SJ4 ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester
Network switch limits behaviour • End2end UDP packets from udpmon • Only 700 Mbit/s throughput • Lots of packet loss • Packet loss distributionshows throughput limited ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester