640 likes | 651 Views
The Performance Bottleneck Application, Computer, or Network. Richard Carlson <rcarlson@internet2.edu> eVLBI Workshop – Performance Tuning Tutorial September 17, 2006. Outline. Why there is a problem What can be done to find/fix problems Tools you can use. Basic Premise.
E N D
The Performance BottleneckApplication, Computer, or Network Richard Carlson <rcarlson@internet2.edu> eVLBI Workshop – Performance Tuning Tutorial September 17, 2006
Outline • Why there is a problem • What can be done to find/fix problems • Tools you can use
Basic Premise • Application’s performance should meet your expectations! • If they don’t you should complain! • But you have to complain effectively.
Questions • How many times have you said: • What’s wrong with the network? • Why is the network so slow? • Do you have any way to find out? • Tools to check local host • Tools to check local network • Tools to check end-to-end path
Unfortunate Reality • Every problem, regardless of cause, exhibits the same symptom • The application performance doesn’t meet the users expectations!
Possible Bottlenecks • Network infrastructure • Host computer/appliance • Application design
Simple Network Picture Bob’s Host Network Infrastructure Carol’s Host
Network Infrastructure Switch 2 Switch 3 R5 Switch 2 Switch 3 R4 R5 R8 R4 R8 R1 R1 R3 R6 R3 R6 Switch 1 Switch 1 R9 R9 R2 R2 R7 R7 Switch 4 Switch 4
Network Infrastructure Bottlenecks • Links too small • Using FastEthernet instead of Gigabit Ethernet • Links congested • Too many hosts crossing this link • Scenic routing • End-to-end path is longer than it needs to be • Broken equipment • Bad NIC, broken wire/cable, cross-talk • Administrative restrictions • Firewalls, Filters, shapers, restrictors
Host Computer Bottlenecks • CPU utilization • What else is the processor doing? • Memory limitations • Main memory and network buffers • I/O bus speed • Getting data into and out of the NIC • Disk access speed
Application Behavior Bottlenecks • Chatty protocol • Lots of short messages between peers • High reliability protocol • Send packet and wait for reply before continuing • No run-time tuning options • Use only default settings • Blaster protocol • Ignore congestion control feedback
Problems, Problems, Problems • Problems can exist at multiple levels • Network infrastructure • Host computer • Application design • Multiple problems can exist at the same time • All problems must be found and fixed before things get better
Transport Protocols 101 • Transmission Control Protocol (TCP) • Provides applications with a reliable in-order delivery service • The most widely used Internet transport protocol • Web, File transfers, email, P2P, Remote login • User Datagram Protocol (UDP) • Provides applications with an unreliable delivery service • RTP, DVTS, DNS
Outline • Why there is a problem • What can be done to find/fix problems • Tools you can use
Remote Image Processing • Carol is analyzing astronomical images. Bob needs to send a data file containing digital images (50 MB per file) to Carol every ½ hour. Bob and Carol are 2,000 miles apart. How long should each transfer take? • 5 minutes? • 1 minute? • 5 seconds?
What should we expect? • Assumptions: • 100 Mbps Fast Ethernet is the slowest link • 50 msec round trip time • Bob & Carol calculate: • 50 MB * 8 = 400 Mbits • 400 Mb / 100 Mb/sec = 4 seconds
Initial Test Results • 18 Minutes!!! This is unacceptable! • First look for network infrastructure problem • Use NDT tester to examine both hosts
NDT Found Duplex Mismatch • Investigating this it is found that the switch port is configured for 100 Mbps Full-Duplex operation. • Network administrator corrects configuration and asks for re-test
Intermediate Results • Time dropped from 18 minutes to 40 seconds. • Is this acceptable??? • Remember your calculations said it should take 4 seconds. • 400 Mb / 40 sec = 10 Mbps • Why are we limited to 10 Mbps? • Are you satisfied with 1/10th of the possible performance?
Calculating the Window Size • Remember Bob found the round-trip time was 50 msec • Calculate window size limit • 85.3KB * 8 b/B = 698777 b • 698777 b / .050 s = 13.98 Mbps • Stated another way • 698777 b / 100 Mb/s = 6.99 msec • 43 msec of idle time every RTT
Calculating the Window Size • Calculate new window size • (100 Mb/s * .050 s) / 8 b/B = 610.3 KB • Use 8MB for testing purposes
Intermediate Results • Use application specific options to manually reset buffer size • Fixes problem for this application • Doesn’t fix problem for other applications • Need better ‘default behavior’ for all applications
Steps so far • Found and fixed Duplex Mismatch • Network Infrastructure problem • Found and fixed TCP window size values • Host configuration problem • Are we done yet?
Intermediate Results • SCP still runs slower than expected • Hint: SSH uses internal buffers • Design choice by Application Developers limit performance • Patch available from PSC
Final Results • Fixed infrastructure problem • Fixed host configuration problem • Fixed Application configuration problem • Achieved target time of 4 seconds to transfer 50 MB file over 2000 miles
Follow-up questions • What would have happened if I tried the patched SCP version before fixing the TCP buffer problem? • Would not have been able to see improvement. • Discard patch because “it didn’t work”?
Why is it hard to Find/Fix Problems? • Network infrastructure is complex • Network infrastructure is shared • Network infrastructure consists of multiple components
Shared Infrastructure • Other applications accessing the network • Remote disk access • Automatic email checking • Heartbeat facilities • Other computers are attached to the closet switch • Uplink to facility infrastructure • Other users on and off site • Uplink from facility to gigapop/backbone
Other Network Components • DHCP (Dynamic Host Resolution Protocol) • At least 2 packets exchanged to configure your host • DNS (Domain Name Resolution) • At least 2 packets exchanged to translate FQDN into IP address • Multiple addresses require a sequential search • Network Security Devices • Intrusion Detection, VPN, Firewall
Why is it hard to Find/Fix Problems? • Computers have multiple components • Each Operating System (OS) has a unique set of tools to tune the network stack • Network Interface Cards also have tuning options • Application Appliances come with few knobs and limited options
Computer Components • Main CPU (clock speed) • Front & Back side bus • Main Memory • I/O Bus (ATA, SCSI, SATA) • Disk (access speed and size)
Computer Issues • Lots of internal components with multi-tasking OS • Lots of tunable TCP/IP parameters that need to be ‘right’ for each possible connection
Why is it hard to Find/Fix Problems? • Applications depend on default system settings • Problems scale with distance • More access to remote resources • 80/20 % rule since the early 1990’s, 80% of your traffic leaves your local network
Default System Settings • For Linux 2.6.13 there are: • 11 tunable IP parameters • 45 tunable TCP parameters • 148 Web100 variables (TCP MIB) • Currently no OS ships with default settings that work well over trans-continental distances • Some applications allow run-time setting of some options • 30 settable/viewable IP parameters • 24 settable/viewable TCP parameters • There are no standard ways to set run-time option ‘flags’
Application Issues • Setting tunable parameters to the ‘right’ value • Getting the protocol ‘right’
Outline • Why there is a problem • What can be done to find/fix problems • Tools you can use
Tools, Tools, Tools • Ping • Traceroute • Iperf • Tcpdump • Tcptrace • BWCTL • NDT • OWAMP • AMP • Advisor • Thrulay • Web100 • MonaLisa • pathchar • NPAD • Pathdiag • Surveyor • Ethereal • CoralReef • MRTG • Skitter • Cflowd • Cricket • Net100
Active Measurement Tools • Tools that inject packets into the network to measure some value • Available Bandwidth • Delay/Jitter • Loss • May require bi-directional traffic or synchronized hosts • May require running test program on both hosts
Passive Measurement Tools • Tools that monitor existing traffic on the network and extract some information • Bandwidth used • Jitter • Loss rate • May generate some privacy and/or security concerns
How do you set realistic Expectations? • Assume network bandwidth exists or find out what the limits are • Local LAN connection • Site Access link • Monitor the link utilization occasionally • Weathermap • MRTG graphs • Look at your host config/utilization • What is the CPU utilization
Distance Matters • It’s harder to go fast over a long distance • TCP congestion control requires numerous round trips to prevent flooding network • TCP buffer limits can stop sender from injecting new data into the network • Application can exhibit poor behavior when used over long distances