150 likes | 162 Views
Experiences Tuning Cluster Hosts. 1GigE and 10GbE Paul Hyder Cooperative Institute for Research in Environmental Sciences, CU Boulder (CIRES at NOAA/ERSL/GSD High Performance Computing) Paul.Hyder at noaa.gov. Tuning Focus. Cluster Front Ends and Cron Server Hosts
E N D
Experiences Tuning Cluster Hosts 1GigE and 10GbE Paul Hyder Cooperative Institute for Research in Environmental Sciences, CU Boulder (CIRES at NOAA/ERSL/GSD High Performance Computing) Paul.Hyder at noaa.gov
Tuning Focus • Cluster Front Ends and Cron Server Hosts • File transfer servers (scponly) • BWCTL host • Remote client hosts • 10GbE Testbed (7.2 Gb/sec uses ~49% of one 3G CPU)
How We Apply the Well Known Rules • Jumbo Frames • 8K on hosts • 9K on network • Tune TCP to match BDP • Encourage application writers to use large read and write buffers • Install tuned Applications • PSC.edu patch to ssh OpenSSH:channels.h #define CHAN_TCP_PACKET_DEFAULT (32*1024) #define CHAN_TCP_WINDOW_DEFAULT (4*CHAN_TCP_PACKET_DEFAULT)
Throughput Testing • Iperf (2.0.2) from shell scripts • Vary buffer (-l) and window (-w) • Modify ifconfig and PCI configuration • Loop takes 3 days • Bwctl with remote hosts • ?Anyone on NLR? • Use scp/sftp/rsync as final test
I’m Curious • How much TCP tuning information do you provide users and admins? • Are hosts being tuned? • Does your internal LAN support jumbo frames?
GSD Cluster GigE Defaults • [wr]mem_default 2MB • [wr]mem_max 16MB • ipv4/tcp_[wr]mem 64KB 2MB 16MB • optmem_max 512K • txqueuelen 10000 • netdev_max_backlog 3000 • ipv4/tcp_sack and ipv4/tcp_timestamps on • Don’t touch ipv4/tcp_mem
What doesn’t work • Jumbo Frames • Switch Fabrics • High density cards • Complex vLAN configurations • Stand alone GigE switches • Firewalls • ICMP for path mtu discovery • Disabled completely • Network devices don’t respond
Linux 2.6 and Jumbos IP hostA.52434 > hostB.22: S 544:544(0) win 16304 <mss 8152,...> IP hostB.22 > hostA.52434: S 207:207(0) ack 545 win 5792 <mss 1460,...> ... IP hostA.52434 > hostB.22: . 2255:6599(4344) ack 2293 win 16304 <...> IP hostA.52434 > hostB.22: P 6599:10943(4344) ack 2293 win 16304 <...> IP router > hostA: icmp 36: hostB unreachable - need to frag (mtu 1500) IP hostA.52434 > hostB.22: . 2255:3703(1448) ack 2293 win 16304 <...>
Host Side Checks • Interrupt Aggregation (Linux NAPI) • Memory to match buffer tuning • More than one CPU • Static ARP entries
Network Device Settings • Static ARP entries or increase timeout • Increase FDB timeouts • Verify jumbo frame configuration
10GbE Quick Notes • Know your PCI hardware (MMRBC, Latency timer, and Splits) • TCP stack is ~0.200ms • Increase netdev_max_backlog to 30000 (throughput = backlog * 100MHz * ave_bytes_pkt) • Set *_cong to CERN values • Write buffers in code ~128KB
Reference URLs • http://www.psc.edu/networking/projects/hpn-ssh/ • http://dast.nlanr.net/Projects/Iperf/ • http://www.sublimation.org/scponly/ • http://e2epi.internet2.edu/bwctl/ • http://abilene.internet2.edu/ami/bwctl_status.cgi/TCP/now • http://www.tcptrace.org/ • http://ultralight.caltech.edu/ • http://staff.science.uva.nl/~delaat/articles/2003-7-10gige.pdf • http://www.csm.ornl.gov/~dunigan/netperf/netlinks.html • http://www.psc.edu/networking/projects/tcptune/ • http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26310.pdf