E N D
1. Developing the Web100 Based Network Diagnostic Tool (NDT) Internet2 piPEs Tutorial
Rich Carlson
RCarlson@internet2.edu
2. 11/9/04 2 Demo http://ndt-newyork.abilene.ucaid.edu:7123
3. 11/9/04 3 Normal operation in campus
4. 11/9/04 4 Duplex Mismatch Detected
5. 11/9/04 5 Low throughput from remote host
6. 11/9/04 6 Increase TCP buffer size
7. 11/9/04 7 Motivation for work Measure performance to users desktop
Develop “single shot” diagnostic tool that doesn’t use historical data
Combine numerous Web100 variables to analyze connection
Develop network signatures for ‘typical’ network problems This page describes the motivations for this work.
The major objective is to provide some method of testing to the desktop.
The speaker should note that it is difficult or impossible for a campus admin to run repetative
Tests to every desktop on site. Even if it were done, running enough test to get a statistically
Valid baseline is extremely difficult. A better approach is needed to allow testing on an as-needed
Basis. The NDT tester meets these design goals.This page describes the motivations for this work.
The major objective is to provide some method of testing to the desktop.
The speaker should note that it is difficult or impossible for a campus admin to run repetative
Tests to every desktop on site. Even if it were done, running enough test to get a statistically
Valid baseline is extremely difficult. A better approach is needed to allow testing on an as-needed
Basis. The NDT tester meets these design goals.
8. 11/9/04 8 Web100 Project Joint PSC/NCAR project funded by NSF
‘First step’ to gather TCP data
Kernel Instrument Set (KIS)
Requires patched Linux kernel
Geared toward wide area network performance
Future steps will automate tuning to improve application performance This page gives some additional background. How the underlying Web100 project came into being and how the NDT
Uses this as the basic data gathering methodology. These are Web100 project goals not NDT goals.This page gives some additional background. How the underlying Web100 project came into being and how the NDT
Uses this as the basic data gathering methodology. These are Web100 project goals not NDT goals.
9. 11/9/04 9 Web Based Performance tool Operates on Any client with a Java enabled Web browser
What it can do
Positively state if Sender, Receiver, or Network is operating properly
Provide accurate application tuning info
Suggest changes to improve performance This is the web based java client. This client means the applet automatically downloads into the client, eliminating the need to pre-install SW on the client machine. This is a plus when a new user wants to test or complain. He/She didn’t need to pre-load any SW before a test could begin.
It is also important to define the NDT strengths and deficiencies (next slide). Note that the performance tuning info is based on getting to the NDT server, not the real application host. Thus the tuning info may be suspect but it should provide the right trends in setting buffer sizes.This is the web based java client. This client means the applet automatically downloads into the client, eliminating the need to pre-install SW on the client machine. This is a plus when a new user wants to test or complain. He/She didn’t need to pre-load any SW before a test could begin.
It is also important to define the NDT strengths and deficiencies (next slide). Note that the performance tuning info is based on getting to the NDT server, not the real application host. Thus the tuning info may be suspect but it should provide the right trends in setting buffer sizes.
10. 11/9/04 10 Web base Performance tool What it can’t do
Tell you where in the network the problem is
Tell you how other servers perform
Tell you how other clients will perform As noted in the previous slide, it’s important to define what the NDT can’t do. There is enough variation is the Internet and in individual hosts that running a test to one desktop will not provide any help in determining how another computer will operate. Neither does it help tell you how your desktop will operate when talking to a different server (system load, file system constraints) all play a role in the wall clock time required to complete a specific task.As noted in the previous slide, it’s important to define what the NDT can’t do. There is enough variation is the Internet and in individual hosts that running a test to one desktop will not provide any help in determining how another computer will operate. Neither does it help tell you how your desktop will operate when talking to a different server (system load, file system constraints) all play a role in the wall clock time required to complete a specific task.
11. 11/9/04 11 Internet2 piPEs Project Develop E2E measurement infrastructure capable of finding network problems
Tools include
BWCTL: Bandwidth Control wrapper for NLANR Iperf
OWAMP: One-Way Active Measurement
NDT: Network Diagnostic Tool This slide, and the next show how the NDT fits into the rest of the piPEs architecture.This slide, and the next show how the NDT fits into the rest of the piPEs architecture.
12. 11/9/04 12 piPEs Integration Boxes in black working and deployed (either released or in prototype form). Boxes in red under development.
Theses are the software components that make up the piPEs measurement framework. Some are released (BWCTL, OWAMP), some are in “prototype format” (Database, Traceroute, PMP, PMC, web service, network monitoring), and some are under development (“Detective Applet”, Discovery module, Analysis module, MDI, NDT). The Measurement Domain Interface (MDI) is a web services interface that speaks the GGF NMWG Request/Report schema and handles authentication and authorization. It is being designed to be interoperable with other measurement frameworks (current and future). The Network Diagnostic Tool (NDT) is an existing tool that the original author is integrating into piPEs. It is designed to detect common problems in the first mile (the common case for most network “issues”).Boxes in black working and deployed (either released or in prototype form). Boxes in red under development.
Theses are the software components that make up the piPEs measurement framework. Some are released (BWCTL, OWAMP), some are in “prototype format” (Database, Traceroute, PMP, PMC, web service, network monitoring), and some are under development (“Detective Applet”, Discovery module, Analysis module, MDI, NDT). The Measurement Domain Interface (MDI) is a web services interface that speaks the GGF NMWG Request/Report schema and handles authentication and authorization. It is being designed to be interoperable with other measurement frameworks (current and future). The Network Diagnostic Tool (NDT) is an existing tool that the original author is integrating into piPEs. It is designed to detect common problems in the first mile (the common case for most network “issues”).
13. 11/9/04 13 Bottleneck Link Detection What is the slowest link in the end-2-end path?
Monitors packet arrival times using libpcap routine
Use TCP dynamics to create packet pairs
Quantize results into link type bins (no fractional or bonded links)
Cisco URP grant work This is the first major task. The issue is, what is the bottleneck link speed. For example suppose you have a 10/100/1000 interface card and the intra building network is GigE based, but you get plugged into a FastE network port. The NDT will tell you that the bottleneck is a Fast E link somewhere in the path. Another example: suppose the path takes you through a slow exchange point and there is a backup Ethernet link being used while the normal FastE link is down for some reason. The NDT will report that a bottleneck Ethernet links exists.
The NDT uses packet dispersion techniques, e.g., it measures the interpacket arrival times for all data and ACK packets sent or received. It also knows the packet size so it can calculate the speed for each pair of packets sent or received. The results are then quantized, meaning that the NDT doesn’t recognize fractional link speed. It’s either Ethennet, T3 or FastE. It wouldn’t detect a bonded Etherchannel interface.This is the first major task. The issue is, what is the bottleneck link speed. For example suppose you have a 10/100/1000 interface card and the intra building network is GigE based, but you get plugged into a FastE network port. The NDT will tell you that the bottleneck is a Fast E link somewhere in the path. Another example: suppose the path takes you through a slow exchange point and there is a backup Ethernet link being used while the normal FastE link is down for some reason. The NDT will report that a bottleneck Ethernet links exists.
The NDT uses packet dispersion techniques, e.g., it measures the interpacket arrival times for all data and ACK packets sent or received. It also knows the packet size so it can calculate the speed for each pair of packets sent or received. The results are then quantized, meaning that the NDT doesn’t recognize fractional link speed. It’s either Ethennet, T3 or FastE. It wouldn’t detect a bonded Etherchannel interface.
14. 11/9/04 14 Duplex Mismatch Detection Developed analytical model to describe how Ethernet responds (no prior art?)
Expanding model to describe UDP and TCP flows
Develop practical detection algorithm
Test models in LAN, MAN, and WAN environments
NIH/NLM grant funding Improving the detection of this problem has been the focus on recent work. We have an analytical model and a detection algorithm was created based on this model. Improving the detection of this problem has been the focus on recent work. We have an analytical model and a detection algorithm was created based on this model.
15. 11/9/04 15 Future enhancements WiFi detection
Faulty Hardware detection
Congestion modification
Full/Half duplex detection
16. 11/9/04 16 Additional Functions and Features Provide basic tuning information
Basic Features
Basic configuration file
FIFO scheduling of tests
Simple server discovery protocol
Federation mode support
Command line client support
Created sourceforge.net project page
Finally a list of basic features.
Configuration files: means the admin can store run time options in a config file
FIFO scheduling: means that the NDT will handle multiple request in a first-come first-servers manner, other users will wait in a queue for service
Simple discover protocol: allows multiple servers to find each other when operating in Federated mode.
Federated mode: allows multiple servers to redirect clients to the ‘closest’ NDT server
Command line client: allows admin to run test remotely without access to web browser
Speaker should note the this is a sourceforge project.Finally a list of basic features.
Configuration files: means the admin can store run time options in a config file
FIFO scheduling: means that the NDT will handle multiple request in a first-come first-servers manner, other users will wait in a queue for service
Simple discover protocol: allows multiple servers to find each other when operating in Federated mode.
Federated mode: allows multiple servers to redirect clients to the ‘closest’ NDT server
Command line client: allows admin to run test remotely without access to web browser
Speaker should note the this is a sourceforge project.
17. 11/9/04 17 Availability Open Source Development project
http://www.sourceforge.net/projects/ndt
Tools available via from
http://e2epi.internet2.edu/ndt/download.html
Contains source code
Email discussion list ndt-users@internet2.edu
Goto http://e2epi.internet2.edu/ndt web site and click
ndt-users – General discussion on NDT tool
ndt-announce – Announcements on new features Finally, where you can go to get the source and email support.Finally, where you can go to get the source and email support.
18. 11/9/04 18 NDT Flow Chart This is the basic flow chart for the NDT program.
The process starts with the user opening a browser and entering the NDT servers URL
An optional step is to point to a well known server and accept a redirect message (Federated mode)
Otherwise the URL points to the NDT server itself (either an apache or the fakewww process answer the request)
The web server responds by returning the page, with an embedded java applet (class or jar file is also returned)
The user must then manually request a test be performed by clicking the “start” button
The applet then opens a connection back to the server’s testing engine (web100srv process)
A child process is created to handle the test and the parent goes back to listening for more test requests. The parent also keeps the FIFO queue needed to process multiple requests.
A control channel is then created between the server and client to control the clients actions and synchronize the start of the various tests.
The client then opens 2 new data channels back to the client, allowing the client to open connections allows the tests to get past client side firewall boxes.
The client opens and closes a connection to perform the middlebox test
The client then streams data back to the server to measure the clients upload speed.
The client then opens another connection and the server streams data back to the client measuring the clients download speed
The server then extracts the web100 data and analyzed the connection for faults.
The results are recorded in the servers’ log file and the results are returned to the client for display to the user.This is the basic flow chart for the NDT program.
The process starts with the user opening a browser and entering the NDT servers URL
An optional step is to point to a well known server and accept a redirect message (Federated mode)
Otherwise the URL points to the NDT server itself (either an apache or the fakewww process answer the request)
The web server responds by returning the page, with an embedded java applet (class or jar file is also returned)
The user must then manually request a test be performed by clicking the “start” button
The applet then opens a connection back to the server’s testing engine (web100srv process)
A child process is created to handle the test and the parent goes back to listening for more test requests. The parent also keeps the FIFO queue needed to process multiple requests.
A control channel is then created between the server and client to control the clients actions and synchronize the start of the various tests.
The client then opens 2 new data channels back to the client, allowing the client to open connections allows the tests to get past client side firewall boxes.
The client opens and closes a connection to perform the middlebox test
The client then streams data back to the server to measure the clients upload speed.
The client then opens another connection and the server streams data back to the client measuring the clients download speed
The server then extracts the web100 data and analyzed the connection for faults.
The results are recorded in the servers’ log file and the results are returned to the client for display to the user.
19. 11/9/04 19 NDT servers This is a list of servers on the Abilene network, and other public servers. Note that this is not a complete list and more are being added when they become available. The latest public server is located in Russia and there is a server located at StarLight. In addition several institutions run private servers notably DOD and possibly DOE NNSA. There are no restrictions on the use, just the University of Chicago public license requirement.This is a list of servers on the Abilene network, and other public servers. Note that this is not a complete list and more are being added when they become available. The latest public server is located in Russia and there is a server located at StarLight. In addition several institutions run private servers notably DOD and possibly DOE NNSA. There are no restrictions on the use, just the University of Chicago public license requirement.
20. 11/9/04 20 Results and Observations Changing desktop effects performance
Faulty Hardware identification
Mathis et.al formula fails
Other topics and observations found after running a public server for several years.Other topics and observations found after running a public server for several years.
21. 11/9/04 21 10 Mbps NIC
Throughput 6.8/6.7 Mbps send/receive
RTT 20 ms
Retransmission/Timeouts 25/3
100 Mbps NIC
Throughput 84/86 Mbps send/receive
RTT 10 ms
Retransmission/Timeouts 0/0
This slide shows why it’s important to test to the users desktop and why having the network staff show up with a ‘good’ laptop doesn’t help much. In this case one laptop client with a 10 Mbps Ethernet NIC saw 7 Mpbs (70% utilization) which is good for a half-duplex connection. Note that some timeouts and retransmission occurred (probably due to the half-duplex nature of the link). When informed of this loss the network admin came in with a ‘tuned’ laptop and ran a test with a 100 Mbps NIC. Good throughput (85% utilization) and no loss. His conclusion is that there is no network problem and I should report that there was. What is a typical user going to think? (I see a problem and the network staff says no problem found)This slide shows why it’s important to test to the users desktop and why having the network staff show up with a ‘good’ laptop doesn’t help much. In this case one laptop client with a 10 Mbps Ethernet NIC saw 7 Mpbs (70% utilization) which is good for a half-duplex connection. Note that some timeouts and retransmission occurred (probably due to the half-duplex nature of the link). When informed of this loss the network admin came in with a ‘tuned’ laptop and ran a test with a 100 Mbps NIC. Good throughput (85% utilization) and no loss. His conclusion is that there is no network problem and I should report that there was. What is a typical user going to think? (I see a problem and the network staff says no problem found)
22. 11/9/04 22 100 Mbps FD
Ave Rtt %loss
5.41 0.00
1.38 0.78
6.16 0.00
14.82 0.00
10 Mbps
72.80 0.01
8.84 0.75
Speed
94.09
22.50
82.66
33.61
6.99
7.15
This example comes from a lab setup. 12 desktop computers all connected to the same Cisco switch with 2 vlans and a Cisco router between the vlans. Note these test are vlan to vlan, e.g., through the router, and in the first 4 cases everything is 100 Mbps full duplex. Note the order of magnitude change in RTT, a factor of 4 in speed changes and no correlation between speed and RTT. Also note that loss never reaches 1%. In the last 2 cases, one of the hosts was changes to a 10 Mbps link. Note the order of magnitude change in RTT but speed remains constant and loss is again below 1%.
Next page describes the network conditions present during the test.This example comes from a lab setup. 12 desktop computers all connected to the same Cisco switch with 2 vlans and a Cisco router between the vlans. Note these test are vlan to vlan, e.g., through the router, and in the first 4 cases everything is 100 Mbps full duplex. Note the order of magnitude change in RTT, a factor of 4 in speed changes and no correlation between speed and RTT. Also note that loss never reaches 1%. In the last 2 cases, one of the hosts was changes to a 10 Mbps link. Note the order of magnitude change in RTT but speed remains constant and loss is again below 1%.
Next page describes the network conditions present during the test.
23. 11/9/04 23 100 Mbps FD
Ave Rtt %loss loss/sec
5.41 0.00 0.03
1.38 0.78 15.11
6.16 0.00 0.03
14.82 0.00 0.10
10 Mbps
72.80 0.01 0.03
8.84 0.75 4.65
Speed
94.09 Good
22.50 Bad NIC
82.66 Bad reverse
33.61 Congestion
6.99 Good
7.15 Bad NIC
Test resutls.
Case 1, everything is operating normally with 100 Mbps full duplex links
The router had a bad interface module, and it was reporting these errors in the router logs, note loss/sec rate
In this case the TCP traffic is flowing in the opposite direction but the bad router interface is still present. (Who would report a problem?)
In this case three pairs of hosts are testing at once, causing congestion on the shared router links (should be reported as normal)
In this case one of the hosts is set to 10 Mbps. (normal operation)
In this case the faulty router interface is again in the path. Note the increased loss/second rate, but speed is still good.
Imagine what happens with GigE attached servers and FastE attached clients. Would anyone complain?Test resutls.
Case 1, everything is operating normally with 100 Mbps full duplex links
The router had a bad interface module, and it was reporting these errors in the router logs, note loss/sec rate
In this case the TCP traffic is flowing in the opposite direction but the bad router interface is still present. (Who would report a problem?)
In this case three pairs of hosts are testing at once, causing congestion on the shared router links (should be reported as normal)
In this case one of the hosts is set to 10 Mbps. (normal operation)
In this case the faulty router interface is again in the path. Note the increased loss/second rate, but speed is still good.
Imagine what happens with GigE attached servers and FastE attached clients. Would anyone complain?
24. 11/9/04 24 Mathis et.al Formula fails Estimate = (K * MSS) / (RTT * sqrt(loss))
old-loss = (Retrans - FastRetran) / (DataPktsOut - AckPktsOut)
new-loss = CongestionSignals / PktsOut
Estimate < Measured (K = 1)
old-loss 91/443 (20.54%)
new-loss 35/443 (7.90%)
This formula describes the normal operating mode for a Reno TCP connection. As noted, the NDT server is reporting that some connections don’t conform to this model. It isn’t clear why this discrepancy exists.This formula describes the normal operating mode for a Reno TCP connection. As noted, the NDT server is reporting that some connections don’t conform to this model. It isn’t clear why this discrepancy exists.
25. 11/9/04 25 NDT Hardware Requirements Minimum requirements
500 MHz Intel or AMD CPU
64 MB of RAM
Fast Ethernet
Buying something now
2 GHz or better processor
256 MB of RAM
Gigabit Ethernet
Disk space for executables and log files
No disk I/O involved during test
26. 11/9/04 26 NDT Software Requirements Web100 enhancements
Linux kernel
User library
Other 3rd party SW needed to compile source
Java SDK
pcap library
Client uses Java JRE (beware of version mismatch)
NDT source file
Test engine (web100srv) requires root authority
27. 11/9/04 27 Recommended Settings There are no settings or options for the Web based java applet.
It allows the user to run a fixed set of tests for a limited time period
Test engine settings
Turn on admin view (-a option)
If multiple network interfaces exist use –i option to specify correct interface to monitor (ethx)
Simple Web server (fakewww)
Use –l fn option to create log file
28. 11/9/04 28 Potential Risks Non-standard kernel required
GUI tools can be used to monitor other ports
Public servers generate trouble reports from remote users
Respond or ignore emails
Test streams can trigger IDS alarms
Configure IDS to ignore NDT server
29. 11/9/04 29 Possible Alternatives Other tools that can perform client testing
Several web sites offer the ability for a user to check PC upload/download speed.
Internet2/Surfnet Detective
NCSA Advisor
30. 11/9/04 30 Supplemental information
31. 11/9/04 31 NDT’s Web100 Based Approach Simple bi-directional test to gather E2E data
Gather multiple data variables from server
Compare measured performance to analytical values
Translate network values into plain text messages
Geared toward campus area network These are NDT goals.
An analogy is that repetitive tests build up an historical record that can point out when changes occur (a depth of
Measurement data). The NDT relies on multiple data variables (a breadth of measurement data) to achieve similar results.These are NDT goals.
An analogy is that repetitive tests build up an historical record that can point out when changes occur (a depth of
Measurement data). The NDT relies on multiple data variables (a breadth of measurement data) to achieve similar results.
32. 11/9/04 32 NDT Benefits End-user based view of network
Can identify configuration problems
Can identify performance bottlenecks
Provides some ‘hard evidence’ to users and network administrators to reduce finger pointing
Doesn’t rely on historical data These are some of the benefits of the NDT system.
Providing hard evidence is an important part of making the user feel that something can be done to improve things.These are some of the benefits of the NDT system.
Providing hard evidence is an important part of making the user feel that something can be done to improve things.
33. 11/9/04 33 NDT methodology Identify specific problem(s) that affect end users
Analyze problem to determine ‘Network Signature’ for this problem
Provide testing tool to automate detection process This introduces the audience to the NDT operation methodology. The next few slides provide the details.This introduces the audience to the NDT operation methodology. The next few slides provide the details.
34. 11/9/04 34 IEEE 802.11 (WiFi) Detection Detect when host is connected via wireless (wifi) link
Radio signal changes strength
NICs implement power saving features
Multiple standards (a/b/g/n)
Some data has been collected
35. 11/9/04 35 Faulty Hardware/Link Detection Detect non-congestive loss due to
Faulty NIC/switch interface
Bad Cat-5 cable
Dirty optical connector
Preliminary works shows that it is possible to distinguish between congestive and non-congestive loss There has been some preliminary work done on detecting this problem. At one point I did find a bad router interface in a test network.There has been some preliminary work done on detecting this problem. At one point I did find a bad router interface in a test network.
36. 11/9/04 36 Full/Half Link Duplex setting Detect half-duplex link in E2E path
Identify when throughput is limited by half-duplex operations
Preliminary work shows detection possible when link transitions between blocking states
This is also an area where more work needs to be performed. The issue is max performance, where a half duplex link will not achieve as high a speed as a full duplex link. Note: that old ethernet hubs require half-duplex operation.This is also an area where more work needs to be performed. The issue is max performance, where a half duplex link will not achieve as high a speed as a full duplex link. Note: that old ethernet hubs require half-duplex operation.
37. 11/9/04 37 Normal congestion detection Shared network infrastructures will cause periodic congestion episodes
Detect/report when TCP throughput is limited by cross traffic
Detect/report when TCP throughput is limited by own traffic
This is another area where more work is required. The issue is to detect when your traffic is sharing the network infrastructure with other users. In this case you should get 1/Nth of the bottleneck link speed. It would also be nice to know when TCP is entering the congestion avoidance phase.This is another area where more work is required. The issue is to detect when your traffic is sharing the network infrastructure with other users. In this case you should get 1/Nth of the bottleneck link speed. It would also be nice to know when TCP is entering the congestion avoidance phase.