360 likes | 459 Views
MB - NG. High Performance Networking for ALL . Members of GridPP are in many Network collaborations including:. Close links with: SLAC UKERNA, SURFNET and other NRNs Dante Internet2 Starlight, Netherlight GGF Ripe Industry …. UKLIGHT. DataGrid WP7 code extended by Gareth Manc
E N D
MB - NG High Performance Networking for ALL Members of GridPP are in many Network collaborations including: Close links with:SLACUKERNA, SURFNET and other NRNsDanteInternet2Starlight, NetherlightGGFRipeIndustry… UKLIGHT GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
DataGrid WP7 code extended by Gareth Manc Technology transfer to UK e-Science Developed by Mark Lees DL Fed back into DataGrid by Gareth Links to: GGF NM-WG, Dante, Internet2 Characteristics, Schema & web services Success Network Monitoring [1] • Architecture GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Network Monitoring [2] 24 Jan to 4 Feb 04 TCP iperf RAL to HEP Only 2 sites >80 Mbit/s 24 Jan to 4 Feb 04 TCP iperf DL to HEP HELP! GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
High bandwidth, Long distance….Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK [r.tasker@dl.ac.uk] DataTAG is a project sponsored by the European Commission - EU Grant IST-2001-32459 RIPE-47, Amsterdam, 29 January 2004
Throughput… What’s the problem? One Terabyte of data transferred in less than an hour On February 27-28 2003, the transatlantic DataTAG network was extended, i.e. CERN - Chicago - Sunnyvale (>10000 km). For the first time, a terabyte of data was transferred across the Atlantic in less than one hour using a single TCP (Reno) stream. The transfer was accomplished from Sunnyvale to Geneva at a rate of 2.38 Gbits/s GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Internet2 Land Speed Record On October 1 2003, DataTAG set a new Internet2 Land Speed Record by transferring 1.1 Terabytes of data in less than 30 minutes from Geneva to Chicago across the DataTAG provision, corresponding to an average rate of 5.44 Gbits/s using a single TCP (Reno) stream GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
So how did we do that? Management of the End-to-End Connection Memory-to-Memory transfer; no disk system involved Processor speed and system bus characteristics TCP Configuration – window size and frame size (MTU) Network Interface Card and associated driver and their configuration End-to-End “no loss” environment from CERN to Sunnyvale! At least a 2.5 Gbits/s capacity pipe on the end-to-end path A single TCP connection on the end-to-end path No real user application That’s to say - not the usual User experience! GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Realistically – what’s the problem & why do network research? End System Issues Network Interface Card and Driver and their configuration TCP and its configuration Operating System and its configuration Disk System Processor speed Bus speed and capability Network Infrastructure Issues Obsolete network equipment Configured bandwidth restrictions Topology Security restrictions (e.g., firewalls) Sub-optimal routing Transport Protocols Network Capacity and the influence of Others! Many, many TCP connections Mice and Elephants on the path Congestion GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
End Hosts: Buses, NICs and Drivers Throughput • Use UDP packets to characterise Intel PRO/10GbE Server Adaptor • SuperMicro P4DP8-G2 motherboard • Dual Xenon 2.2GHz CPU • 400 MHz System bus • 133 MHz PCI-X bus Latency Bus Activity GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
End Hosts: Understanding NIC Drivers • Linux driver basics – TX • Application system call • Encapsulation in UDP/TCP and IP headers • Enqueue on device send queue • Driver places information in DMA descriptor ring • NIC reads data from main memory • via DMA and sends on wire • NIC signals to processor that TX • descriptor sent • Linux driver basics – RX • NIC places data in main memory via • DMA to a free RX descriptor • NIC signals RX descriptor has data • Driver passes frame to IP layer and • cleans RX descriptor • IP layer passes data to application • Linux NAPI driver model • On receiving a packet, NIC raises interrupt • Driver switches off RX interrupts and schedules RX DMA ring poll • Frames are pulled off DMA ring and is processed up to application • When all frames are processed RX interrupts are re-enabled • Dramatic reduction in RX interrupts under load • Improving the performance of a Gigabit Ethernet driver under Linux http://datatag.web.cern.ch/datatag/papers/drafts/linux_kernel_map/ GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Protocols: TCP (Reno) – Performance • AIMD and High Bandwidth – Long Distance networks Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm • For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Protocols: HighSpeed TCP & Scalable TCP • Adjusting the AIMD Algorithm – TCP Reno • For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd - Additive Increase, a=1 • For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ • High Speed TCP a and b vary depending on current cwnd where • a increases more rapidly with larger cwnd and as a consequence returns to the ‘optimal’ cwnd size sooner for the network path; and • b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. • Scalable TCP a and b are fixed adjustments for the increase and decrease of cwnd such that the increase is greater than TCP Reno, and the decrease on loss is less than TCP Reno GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Protocols: HighSpeed TCP & Scalable TCP • HighSpeed TCP • Scalable TCP Success HighSpeed TCP implemented by Gareth Manc Scalable TCP implemented by Tom Kelly Camb Integration of stacks into DataTAG Kernel Yee UCL + Gareth GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Some Measurements of Throughput CERN -SARA • Using the GÉANT Backup Link • 1 GByte file transfers • Blue Data • Red TCP ACKs • Standard TCP • Average Throughput 167 Mbit/s • Users see 5 - 50 Mbit/s! • High-Speed TCP • Average Throughput 345 Mbit/s • Scalable TCP • Average Throughput 340 Mbit/s GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Pete White Pat Meyrs Users, The Campus & the MAN [1] • NNW – to – SJ4 Access 2.5 Gbit PoS Hits 1 Gbit 50 % • Man – NNW Access 2 * 1 Gbit Ethernet GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Users, The Campus & the MAN [2] • Message: • Continue to work with your network group • Understand the traffic levels • Understand the Network Topology • LMN to site 1 Access 1 Gbit Ethernet • LMN to site 2 Access 1 Gbit Ethernet GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
10 GigEthernet: Tuning PCI-X GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
10 GigEthernet at SC2003 BW Challenge (Phoenix) • Three Server systems with 10 GigEthernet NICs • Used the DataTAG altAIMD stack 9000 byte MTU • Streams From SLAC/FNAL booth in Phoenix to: • Pal Alto PAIX 17 ms rtt • Chicago Starlight 65 ms rtt • Amsterdam SARA 175 ms rtt GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Helping Real Users [1]Radio Astronomy VLBIPoC with NRNs & GEANT 1024 Mbit/s 24 on 7 NOW GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
VLBI Project: Throughput Jitter & 1-way Delay • 1472 byte Packets Manchester -> Dwingeloo JIVE • 1472 byte Packets man -> JIVE • FWHM 22 µs (B2B 3 µs ) • 1-way Delay – note the packet loss (points with zero 1 –way delay) GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
VLBI Project: Packet Loss Distribution • Measure the time between lost packets in the time series of packets sent. • Lost 1410 in 0.6s • Is it a Poisson process? • Assume Poisson is stationary λ(t) = λ • Use Prob. Density Function:P(t) = λ e-λt • Mean λ = 2360 / s[426 µs] • Plot log: slope -0.0028expect -0.0024 • Could be additional process involved GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
VLBI Traffic Flows – Only testing! • Manchester – NetNorthWest - SuperJANET Access links • Two 1 Gbit/s • Access links:SJ4 to GÉANT GÉANT to SurfNet GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Read / Write n bytes time Wait time Throughput & PCI transactions on the Mark5 PC: • Mark5 uses Supermicro P3TDLE • 1.2 GHz PIII • Mem bus 133/100 MHz • 2 *64bit 66 MHz PCI • 4 32bit 33 MHz PCI Ethernet NIC IDE Disc Pack SuperStor Input Card Logic Analyser Display GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
PCI Activity: Read Multiple data blocks 0 wait • Read 999424 bytes • Each Data block: • Setup CSRs • Data movement • Update CSRs • For 0 wait between reads: • Data blocks ~600µs longtake ~6 ms • Then 744µs gap • PCI transfer rate 1188Mbit/s(148.5 Mbytes/s) • Read_sstor rate 778 Mbit/s (97 Mbyte/s) • PCI bus occupancy: 68.44% • Concern about Ethernet Traffic 64 bit 33 MHz PCI needs ~ 82% for 930 Mbit/s Expect ~360 Mbit/s Data transfer Data Block131,072 bytes CSR Access GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester PCI Burst 4096 bytes
PCI Activity: Read Throughput • Flat then 1/t dependance • ~ 860 Mbit/s for Read blocks >= 262144 bytes • CPU load ~20% • Concern about CPU load needed to drive Gigabit link GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
Helping Real Users [2]HEPBaBar & CMSApplication Throughput GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
BaBar Case Study: Disk Performace • BaBar Disk Server • Tyan Tiger S2466N motherboard • 1 64bit 66 MHz PCI bus • Athlon MP2000+ CPU • AMD-760 MPX chipset • 3Ware 7500-8 RAID5 • 8 * 200Gb Maxtor IDE 7200rpm disks • Note the VM parameterreadahead max • Disk to memory (read)Max throughput 1.2 Gbit/s 150 MBytes/s) • Memory to disk (write)Max throughput 400 Mbit/s 50 MBytes/s)[not as fast as Raid0] GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
BaBar: Serial ATA Raid Controllers • ICP 66 MHz PCI • 3Ware 66 MHz PCI GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
BaBar Case Study: RAID Throughput & PCI Activity • 3Ware 7500-8 RAID5 parallel EIDE • 3Ware forces PCI bus to 33 MHz • BaBar Tyan to MB-NG SuperMicroNetwork mem-mem 619 Mbit/s • Disk – disk throughput bbcp40-45 Mbytes/s (320 – 360 Mbit/s) • PCI bus effectively full! Read from RAID5 Disks Write to RAID5 Disks GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
MAN MCC OSM-1OC48-POS-SS Gigabit Ethernet 2.5 Gbit POS Access 2.5 Gbit POS core MPLS Admin. Domains SJ4 Dev SJ4 Dev SJ4 Dev PC BarBar PC SJ4 Dev MB - NG RAL OSM-1OC48-POS-SS 3ware RAID5 3ware RAID5 PC PC PC PC MB – NG SuperJANET4 Development NetworkBaBar Case Study Status / Tests: • Manc host has DataTAG TCP stack • RAL Host now available • BaBar-BaBar mem-mem • BaBar-BaBar real data MB-NG • BaBar-BaBar real data SJ4 • Mbng-mbng real data MB-NG • Mbng-mbng real data SJ4 • Different TCP stacks already installed GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
MAN MCC OSM-1OC48-POS-SS Gigabit Ethernet 2.5 Gbit POS Access 2.5 Gbit POS core MPLS Admin. Domains SJ4 Dev SJ4 Dev SJ4 Dev PC PC SJ4 Dev MB - NG UCL OSM-1OC48-POS-SS 3ware RAID0 3ware RAID0 PC PC PC PC Study of Applications MB – NG SuperJANET4 Development Network GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
MB - NG 24 Hours HighSpeed TCP mem-mem • TCP mem-mem lon2-man1 • Tx 64 Tx-abs 64 • Rx 64 Rx-abs 128 • 941.5 Mbit/s +- 0.5 Mbit/s GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
MB - NG Gridftp Throughput HighSpeedTCP • Int Coal 64 128 • Txqueuelen 2000 • TCP buffer 1 M byte(rtt*BW = 750kbytes) • Interface throughput • Acks received • Data moved • 520 Mbit/s • Same for B2B tests • So its not that simple! GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
MB - NG Gridftp Throughput + Web100 • Throughput Mbit/s: • See alternate 600/800 Mbitand zero • Cwnd smooth • No dup Ack / send stall /timeouts GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
MB - NG http data transfers HighSpeed TCP • Apachie web server out of the box! • prototype client - curl http library • 1Mbyte TCP buffers • 2Gbyte file • Throughput 72 MBytes/s • Cwnd - some variation • No dup Ack / send stall /timeouts GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester
More Information Some URLs • MB-NG project web site:http://www.mb-ng.net/ • DataTAG project web site: http://www.datatag.org/ • UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net • Motherboard and NIC Tests: www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ • TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html& http://www.psc.edu/networking/perf_tune.html GridPP Meeting Edinburgh 4-5 Feb 04 R. Hughes-Jones Manchester