341 likes | 692 Views
Alan Whinery U. Hawaii ITS April 7, 2010. Optimizing Network Performance. IP, TCP, ICMP. When you transfer a file with HTTP or FTP A TCP connection is set up between sender and reciver
E N D
Alan Whinery U. Hawaii ITS April 7, 2010 Optimizing Network Performance
IP, TCP, ICMP • When you transfer a file with HTTP or FTP • A TCP connection is set up between sender and reciver • The sending computer hands the file to TCP, which slices the file into pieces, called segments, which it assigns numbers, called Sequence Numbers • TCP hands each piece to IP, which makes datagrams • IP hands each piece to Ethernet driver, which transmits frames • (continued >>> )
IP, TCP, ICMP • Ethernet carries the frame (through switches) to a router, which: • takes the IP datagrams out of the Ethernet frames • decides where it should go next • Check cache OR queue for CPU • If it is not forwarded*, the router may send an ICMP message back to the sender to tell it why • hands it to a different Ethernet driver • etc. • (...) * reasons routers neglect to forward: no route, expired TTL, failed IP checksum, Access-list drop, input-queue flushes, selective discard
IP, TCP, ICMP • The last router delivers the datagrams to the receiving computer by sending them in frames across the final link • the receiving computer extracts the datagrams from the frames, • extracts the segments from the datagrams • sends a TCP acknowledgement for this segment's Sequence Number back to the sender • good segments are handed to the application (i.e. web browser) which will write them to a file on disk
elements on each end computer • Disk – data rate, errors • DMA – data rate, errors • Ethernet (link) driver – link neg., speed duplex, errors • Features: (Int. Coa., Chk. Off., Seg. Off.) • buffer sizes, frame size • FCS check • TCP (OS) – transport, error/congestion recovery • Features (Con. Av., Buffer sizes, SACK,ECN,TS) • parameters – MSS, buffer/window sizes • IP4 (OS) – MTU, TTL, Checksum • IP6 (OS) – MTU, Hop Limit • Cable or transmission space
Brain teaser • A packet capture near a major UHNet ingress/egress point will observe IP datagrams with Good CRCs carrying TCP with bad CRCs. • On the order of a dozen or so per hour • How can this be? • It's either an unimaginable coincidence, OR • The source host has bit errors between the calculation of TCP checksum and that of IP checksum
elements on each switch (L2/bridge) • link negotiation/physical • input queue • output queue • vlan tagging/processing • FCS check • Spanning Tree (changes/port-change-blocking)
elements on each router • Everything the switch has, plus • route table/route cache • changing, possibly temporarily invalid • When cache changes, “process routing” adds latency • ARP
TCP • Like pouring water from a bucket into a two-liter soda bottle. • (important to take the cap off first) :^) • If you pour too fast, some water gets lost • when loss occurs, you pour more slowly • TCP continues re-trying until all of the water is in the bottle
Round Trip Time • RTT, similar to the round trip time reported by “ping”, is how long it takes a packet to traverse the network from the sender to the receiver and then back to the sender.
Bandwidth * Delay Product • BDP is the one-half RTT times the useful “bottleneck” transmission rate (BW) of the network path • It's actually BW * the one-way delay -- 0.5 * RTT is an estimate of one-way delay • Equal to the amount of data that will be “in flight” in a “full pipe” from the Sender to the receiver when the earliest possible ACK is received.
How TCP works • S = sender R = receiver • S & R set up a “connection” • S & R negotiate RWIN MSS, etc • S starts sending segments not larger than MSS • R starts acknowledging segments as they are received in good condition. • Acknowledgments refer to last segment received, not every single segment • S limits unacknowledged “data in flight” to R's advertised RWIN
How TCP works • TCP performance on a connection is limited by the following three numbers: • Sender's socket buffer (you can set this) • Must hold 2 * BDP of data to “fill pipe” • Congestion Window (calculated during transfer) • Sender's estimate of the available bandwidth • Scratchpad number kept by sender based on ACK/loss history • Receiver's Receive Window (you can set this) • must equal ~ BDP to “fill pipe” • These can be specified with nuttcp and iperf • OS defaults can be specified in each OS
How TCP works • original TCP • was unable to deal with out-of-order segments • was forced to throw away received segments that occurred after a lost segment • Modern TCP Has • SACK (selective acknowledgements) • Timestamps • Explicit Congestion Notification
TCP Congestion Avoidance • Early TCP performed poorly in the face of lost packets, a problem which became more serious as transfer rates increased • Although bit-rates went up, RTT remained the same. • Many TCP variants have been customized for large bandwidth-delay products • HSTCP, FAST TCP, BIC TCP, CUBIC TCP, H-TCP, Compound TCP
Modern Ethernet drivers • Current Ethernet devices offer several optimizations • TCP/IP checksum offloading • NIC chipset does checksumming for TCP and Ipv4 • TCP segmentation offloading • OS sends large blocks of data to NIC, NIC chops it up • Implies TCP Checksum offloading • Interrupt Coalescing • After receiving an Ethernet frame, NIC waits for more before raising interrupt to ICU
Modern Ethernet drivers • Optimizing the NIC's switch connection(s) • Teaming • Combining more than one NIC into one “link” • Flow-control (PAUSE frames) • Allowing the switch to pause the NIC's sending • I have not found an example of negative effects • Can band-aid problem NICs by smoothing rate and preventing queue drops (and therefore keeping TCP from seeing congestion) • VLANs • Very useful on some servers, as you can set up several interfaces on one NIC • Although it is offered in some Windows drivers, I have only made it work in Linux
Modern Ethernet drivers • Optimizing the driver's use of the bus/dma/etc. Or Ethernet switch • Scatter-gather • Multipart DMA transfers • Write-combining • Data transfer “coalescing” • Message Signaled interrupts • PCI 2.2 and PCI-E messages that expand available interrupts and relieve the need for interrupt connector pins • Multiple receive queues (hardware steering)
Modern Ethernet drivers • Although there are gains to be had from tweaking offloading and other opts • Always baseline a system with defaults before changing things • Sometimes, disabling all offloading and coalescing can stabilize performance (perhaps exposing a bug) • Segmentation offloading affects a machine's perspective when packet capturing its own frames on its own interface
ethtool • Linux utility for interacting with Ethernet drivers • Support and output format varies between drivers • Shows useful statistics • View or set features (offloading, coalescing, etc) • Set Ethernet driver ring buffer sizes • Blink LEDs for NIC identification • Show link condition, speed, duplex, etc.
ethtool • Linux utility for interacting with Ethernet drivers • root@bongo:~# ethtool eth0 Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: MII PHYAD: 1 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Link detected: yes
ethtool • Linux utility for interacting with Ethernet drivers root@bongo:~# ethtool -i eth0 driver: forcedeth version: 0.61 firmware-version: Bus-info: 0000:00:14.0 root@uhmanoa:/home/whinery# ethtool eth2 Settings for eth2: Supported ports: [ ] Supported link modes: Supports auto-negotiation: No Advertised link modes: Not reported Advertised auto-negotiation: No Speed: Unknown! (10000) Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: off Current message level: 0x00000004 (4) Link detected: yes
modinfo • Extract status and documentation from Linux modules (like Ethernet drivers) root@bongo:~# modinfo forcedeth filename: /lib/modules/2.6.24-26-rt/kernel/drivers/net/forcedeth.ko license: GPL description: Reverse Engineered nForce ethernet driver author: Manfred Spraul <manfred@colorfullife.com> srcversion: 9A02DCF1CF871DD11BB129E alias: pci:v000010DEd00000AB3sv*sd*bc*sc*i* (...) depends: vermagic: 2.6.24-26-rt SMP preempt mod_unload parm: max_interrupt_work:forcedeth maximum events handled per interrupt (int) parm: optimization_mode:In throughput mode (0), every tx & rx packet will generate an interrupt. In CPU mode (1), interrupts are controlled by a timer. (int) parm: poll_interval:Interval determines how frequent timer interrupt is generated by [(time_in_micro_secs * 100) / (2^10)]. Min is 0 and Max is 65535. (int) parm: msi:MSI interrupts are enabled by setting to 1 and disabled by setting to 0. (int) parm: msix:MSIX interrupts are enabled by setting to 1 and disabled by setting to 0. (int) parm: dma_64bit:High DMA is enabled by setting to 1 and disabled by setting to 0. (int)
NDT • Network Diagnostic Tool written by Rich Carlson of US Dept. of Energy Argonne Lab/Internet2 • Server written in C, primary client is a Java Applet
NPAD (Network Path and Application Diagnosis) • By Matt Mathis and John Heffner, Pittsburgh Supercomputing Center • Allows for analysis of network loss, throughput not for a target rate and RTT • Attempts to guide user to solution of network problems
Iperf • Command-line throughput test server/client • Works on Linux/Windows/Mac OS X/ etc. • Originally developed by NLANR/DAST • Performs unicast TCP and UDP tests • Performs multicast UDP tests • Allows setting TCP parameters • Original development ended in 2002 • Sourceforge fork project has produced mixed results
Nuttcp • Command-line throughput test server/client • Runs on Linux, Windows, Mac OS X etc • By Bill Fink, Rob Scott • Does everything iperf does • Also third party testing • Bidirectional traceroutes • More extensive output
Nuttcp • nuttcp -T30 -i1 -vv 192.168.222.5 • 30 second TCP send from this host to target • nuttcp -T30 -i1 -vv 192.168.2.1 192.168.2.2 • 30 second TCP send from 2.1 to 2.2 • This host is neither 2.1 nor 2.2 • Each of the slaves must be running “nuttcp -S”
Nuttcp (or iperf) and periodic reports C:\bin\nuttcp>nuttcp.exe -i1 -T10 128.171.6.156 22.1875 MB / 1.00 sec = 186.0967 Mbps 7.3125 MB / 1.00 sec = 61.3394 Mbps 14.0000 MB / 1.00 sec = 117.4402 Mbps 12.8125 MB / 1.00 sec = 107.4796 Mbps 7.1250 MB / 1.00 sec = 59.7715 Mbps 6.4375 MB / 1.00 sec = 53.9991 Mbps 10.7500 MB / 1.00 sec = 90.1771 Mbps 4.8750 MB / 1.00 sec = 40.8945 Mbps 9.5625 MB / 1.00 sec = 80.2164 Mbps 1.9375 MB / 1.00 sec = 16.2529 Mbps 97.0625 MB / 10.11 sec = 80.5500 Mbps 3 %TX 6 %RX • Seeing 10 1-second samples tells you more about a test than one 10-second average
Testing notes • Neither iperf nor nuttcp uses TCP auto-tuning