Developing the Web100 Based Network Diagnostic Tool NDT

1. Developing the Web100 Based Network Diagnostic Tool (NDT) Internet2 piPEs Tutorial Rich Carlson RCarlson@internet2.edu

2. 11/9/04 2 Demo http://ndt-newyork.abilene.ucaid.edu:7123

3. 11/9/04 3 Normal operation in campus

4. 11/9/04 4 Duplex Mismatch Detected

5. 11/9/04 5 Low throughput from remote host

6. 11/9/04 6 Increase TCP buffer size

7. 11/9/04 7 Motivation for work Measure performance to users desktop Develop �single shot� diagnostic tool that doesn�t use historical data Combine numerous Web100 variables to analyze connection Develop network signatures for �typical� network problems This page describes the motivations for this work. The major objective is to provide some method of testing to the desktop. The speaker should note that it is difficult or impossible for a campus admin to run repetative Tests to every desktop on site. Even if it were done, running enough test to get a statistically Valid baseline is extremely difficult. A better approach is needed to allow testing on an as-needed Basis. The NDT tester meets these design goals.This page describes the motivations for this work. The major objective is to provide some method of testing to the desktop. The speaker should note that it is difficult or impossible for a campus admin to run repetative Tests to every desktop on site. Even if it were done, running enough test to get a statistically Valid baseline is extremely difficult. A better approach is needed to allow testing on an as-needed Basis. The NDT tester meets these design goals.

8. 11/9/04 8 Web100 Project Joint PSC/NCAR project funded by NSF �First step� to gather TCP data Kernel Instrument Set (KIS) Requires patched Linux kernel Geared toward wide area network performance Future steps will automate tuning to improve application performance This page gives some additional background. How the underlying Web100 project came into being and how the NDT Uses this as the basic data gathering methodology. These are Web100 project goals not NDT goals.This page gives some additional background. How the underlying Web100 project came into being and how the NDT Uses this as the basic data gathering methodology. These are Web100 project goals not NDT goals.

9. 11/9/04 9 Web Based Performance tool Operates on Any client with a Java enabled Web browser What it can do Positively state if Sender, Receiver, or Network is operating properly Provide accurate application tuning info Suggest changes to improve performance This is the web based java client. This client means the applet automatically downloads into the client, eliminating the need to pre-install SW on the client machine. This is a plus when a new user wants to test or complain. He/She didn�t need to pre-load any SW before a test could begin. It is also important to define the NDT strengths and deficiencies (next slide). Note that the performance tuning info is based on getting to the NDT server, not the real application host. Thus the tuning info may be suspect but it should provide the right trends in setting buffer sizes.This is the web based java client. This client means the applet automatically downloads into the client, eliminating the need to pre-install SW on the client machine. This is a plus when a new user wants to test or complain. He/She didn�t need to pre-load any SW before a test could begin. It is also important to define the NDT strengths and deficiencies (next slide). Note that the performance tuning info is based on getting to the NDT server, not the real application host. Thus the tuning info may be suspect but it should provide the right trends in setting buffer sizes.

10. 11/9/04 10 Web base Performance tool What it can�t do Tell you where in the network the problem is Tell you how other servers perform Tell you how other clients will perform As noted in the previous slide, it�s important to define what the NDT can�t do. There is enough variation is the Internet and in individual hosts that running a test to one desktop will not provide any help in determining how another computer will operate. Neither does it help tell you how your desktop will operate when talking to a different server (system load, file system constraints) all play a role in the wall clock time required to complete a specific task.As noted in the previous slide, it�s important to define what the NDT can�t do. There is enough variation is the Internet and in individual hosts that running a test to one desktop will not provide any help in determining how another computer will operate. Neither does it help tell you how your desktop will operate when talking to a different server (system load, file system constraints) all play a role in the wall clock time required to complete a specific task.

11. 11/9/04 11 Internet2 piPEs Project Develop E2E measurement infrastructure capable of finding network problems Tools include BWCTL: Bandwidth Control wrapper for NLANR Iperf OWAMP: One-Way Active Measurement NDT: Network Diagnostic Tool This slide, and the next show how the NDT fits into the rest of the piPEs architecture.This slide, and the next show how the NDT fits into the rest of the piPEs architecture.

12. 11/9/04 12 piPEs Integration Boxes in black working and deployed (either released or in prototype form). Boxes in red under development. Theses are the software components that make up the piPEs measurement framework. Some are released (BWCTL, OWAMP), some are in �prototype format� (Database, Traceroute, PMP, PMC, web service, network monitoring), and some are under development (�Detective Applet�, Discovery module, Analysis module, MDI, NDT). The Measurement Domain Interface (MDI) is a web services interface that speaks the GGF NMWG Request/Report schema and handles authentication and authorization. It is being designed to be interoperable with other measurement frameworks (current and future). The Network Diagnostic Tool (NDT) is an existing tool that the original author is integrating into piPEs. It is designed to detect common problems in the first mile (the common case for most network �issues�).Boxes in black working and deployed (either released or in prototype form). Boxes in red under development. Theses are the software components that make up the piPEs measurement framework. Some are released (BWCTL, OWAMP), some are in �prototype format� (Database, Traceroute, PMP, PMC, web service, network monitoring), and some are under development (�Detective Applet�, Discovery module, Analysis module, MDI, NDT). The Measurement Domain Interface (MDI) is a web services interface that speaks the GGF NMWG Request/Report schema and handles authentication and authorization. It is being designed to be interoperable with other measurement frameworks (current and future). The Network Diagnostic Tool (NDT) is an existing tool that the original author is integrating into piPEs. It is designed to detect common problems in the first mile (the common case for most network �issues�).

13. 11/9/04 13 Bottleneck Link Detection What is the slowest link in the end-2-end path? Monitors packet arrival times using libpcap routine Use TCP dynamics to create packet pairs Quantize results into link type bins (no fractional or bonded links) Cisco URP grant work This is the first major task. The issue is, what is the bottleneck link speed. For example suppose you have a 10/100/1000 interface card and the intra building network is GigE based, but you get plugged into a FastE network port. The NDT will tell you that the bottleneck is a Fast E link somewhere in the path. Another example: suppose the path takes you through a slow exchange point and there is a backup Ethernet link being used while the normal FastE link is down for some reason. The NDT will report that a bottleneck Ethernet links exists. The NDT uses packet dispersion techniques, e.g., it measures the interpacket arrival times for all data and ACK packets sent or received. It also knows the packet size so it can calculate the speed for each pair of packets sent or received. The results are then quantized, meaning that the NDT doesn�t recognize fractional link speed. It�s either Ethennet, T3 or FastE. It wouldn�t detect a bonded Etherchannel interface.This is the first major task. The issue is, what is the bottleneck link speed. For example suppose you have a 10/100/1000 interface card and the intra building network is GigE based, but you get plugged into a FastE network port. The NDT will tell you that the bottleneck is a Fast E link somewhere in the path. Another example: suppose the path takes you through a slow exchange point and there is a backup Ethernet link being used while the normal FastE link is down for some reason. The NDT will report that a bottleneck Ethernet links exists. The NDT uses packet dispersion techniques, e.g., it measures the interpacket arrival times for all data and ACK packets sent or received. It also knows the packet size so it can calculate the speed for each pair of packets sent or received. The results are then quantized, meaning that the NDT doesn�t recognize fractional link speed. It�s either Ethennet, T3 or FastE. It wouldn�t detect a bonded Etherchannel interface.

14. 11/9/04 14 Duplex Mismatch Detection Developed analytical model to describe how Ethernet responds (no prior art?) Expanding model to describe UDP and TCP flows Develop practical detection algorithm Test models in LAN, MAN, and WAN environments NIH/NLM grant funding Improving the detection of this problem has been the focus on recent work. We have an analytical model and a detection algorithm was created based on this model. Improving the detection of this problem has been the focus on recent work. We have an analytical model and a detection algorithm was created based on this model.

15. 11/9/04 15 Future enhancements WiFi detection Faulty Hardware detection Congestion modification Full/Half duplex detection

16. 11/9/04 16 Additional Functions and Features Provide basic tuning information Basic Features Basic configuration file FIFO scheduling of tests Simple server discovery protocol Federation mode support Command line client support Created sourceforge.net project page Finally a list of basic features. Configuration files: means the admin can store run time options in a config file FIFO scheduling: means that the NDT will handle multiple request in a first-come first-servers manner, other users will wait in a queue for service Simple discover protocol: allows multiple servers to find each other when operating in Federated mode. Federated mode: allows multiple servers to redirect clients to the �closest� NDT server Command line client: allows admin to run test remotely without access to web browser Speaker should note the this is a sourceforge project.Finally a list of basic features. Configuration files: means the admin can store run time options in a config file FIFO scheduling: means that the NDT will handle multiple request in a first-come first-servers manner, other users will wait in a queue for service Simple discover protocol: allows multiple servers to find each other when operating in Federated mode. Federated mode: allows multiple servers to redirect clients to the �closest� NDT server Command line client: allows admin to run test remotely without access to web browser Speaker should note the this is a sourceforge project.

17. 11/9/04 17 Availability Open Source Development project http://www.sourceforge.net/projects/ndt Tools available via from http://e2epi.internet2.edu/ndt/download.html Contains source code Email discussion list ndt-users@internet2.edu Goto http://e2epi.internet2.edu/ndt web site and click ndt-users � General discussion on NDT tool ndt-announce � Announcements on new features Finally, where you can go to get the source and email support.Finally, where you can go to get the source and email support.

18. 11/9/04 18 NDT Flow Chart This is the basic flow chart for the NDT program. The process starts with the user opening a browser and entering the NDT servers URL An optional step is to point to a well known server and accept a redirect message (Federated mode) Otherwise the URL points to the NDT server itself (either an apache or the fakewww process answer the request) The web server responds by returning the page, with an embedded java applet (class or jar file is also returned) The user must then manually request a test be performed by clicking the �start� button The applet then opens a connection back to the server�s testing engine (web100srv process) A child process is created to handle the test and the parent goes back to listening for more test requests. The parent also keeps the FIFO queue needed to process multiple requests. A control channel is then created between the server and client to control the clients actions and synchronize the start of the various tests. The client then opens 2 new data channels back to the client, allowing the client to open connections allows the tests to get past client side firewall boxes. The client opens and closes a connection to perform the middlebox test The client then streams data back to the server to measure the clients upload speed. The client then opens another connection and the server streams data back to the client measuring the clients download speed The server then extracts the web100 data and analyzed the connection for faults. The results are recorded in the servers� log file and the results are returned to the client for display to the user.This is the basic flow chart for the NDT program. The process starts with the user opening a browser and entering the NDT servers URL An optional step is to point to a well known server and accept a redirect message (Federated mode) Otherwise the URL points to the NDT server itself (either an apache or the fakewww process answer the request) The web server responds by returning the page, with an embedded java applet (class or jar file is also returned) The user must then manually request a test be performed by clicking the �start� button The applet then opens a connection back to the server�s testing engine (web100srv process) A child process is created to handle the test and the parent goes back to listening for more test requests. The parent also keeps the FIFO queue needed to process multiple requests. A control channel is then created between the server and client to control the clients actions and synchronize the start of the various tests. The client then opens 2 new data channels back to the client, allowing the client to open connections allows the tests to get past client side firewall boxes. The client opens and closes a connection to perform the middlebox test The client then streams data back to the server to measure the clients upload speed. The client then opens another connection and the server streams data back to the client measuring the clients download speed The server then extracts the web100 data and analyzed the connection for faults. The results are recorded in the servers� log file and the results are returned to the client for display to the user.

19. 11/9/04 19 NDT servers This is a list of servers on the Abilene network, and other public servers. Note that this is not a complete list and more are being added when they become available. The latest public server is located in Russia and there is a server located at StarLight. In addition several institutions run private servers notably DOD and possibly DOE NNSA. There are no restrictions on the use, just the University of Chicago public license requirement.This is a list of servers on the Abilene network, and other public servers. Note that this is not a complete list and more are being added when they become available. The latest public server is located in Russia and there is a server located at StarLight. In addition several institutions run private servers notably DOD and possibly DOE NNSA. There are no restrictions on the use, just the University of Chicago public license requirement.

20. 11/9/04 20 Results and Observations Changing desktop effects performance Faulty Hardware identification Mathis et.al formula fails Other topics and observations found after running a public server for several years.Other topics and observations found after running a public server for several years.

21. 11/9/04 21 10 Mbps NIC Throughput 6.8/6.7 Mbps send/receive RTT 20 ms Retransmission/Timeouts 25/3 100 Mbps NIC Throughput 84/86 Mbps send/receive RTT 10 ms Retransmission/Timeouts 0/0 This slide shows why it�s important to test to the users desktop and why having the network staff show up with a �good� laptop doesn�t help much. In this case one laptop client with a 10 Mbps Ethernet NIC saw 7 Mpbs (70% utilization) which is good for a half-duplex connection. Note that some timeouts and retransmission occurred (probably due to the half-duplex nature of the link). When informed of this loss the network admin came in with a �tuned� laptop and ran a test with a 100 Mbps NIC. Good throughput (85% utilization) and no loss. His conclusion is that there is no network problem and I should report that there was. What is a typical user going to think? (I see a problem and the network staff says no problem found)This slide shows why it�s important to test to the users desktop and why having the network staff show up with a �good� laptop doesn�t help much. In this case one laptop client with a 10 Mbps Ethernet NIC saw 7 Mpbs (70% utilization) which is good for a half-duplex connection. Note that some timeouts and retransmission occurred (probably due to the half-duplex nature of the link). When informed of this loss the network admin came in with a �tuned� laptop and ran a test with a 100 Mbps NIC. Good throughput (85% utilization) and no loss. His conclusion is that there is no network problem and I should report that there was. What is a typical user going to think? (I see a problem and the network staff says no problem found)

22. 11/9/04 22 100 Mbps FD Ave Rtt %loss 5.41 0.00 1.38 0.78 6.16 0.00 14.82 0.00 10 Mbps 72.80 0.01 8.84 0.75 Speed 94.09 22.50 82.66 33.61 6.99 7.15 This example comes from a lab setup. 12 desktop computers all connected to the same Cisco switch with 2 vlans and a Cisco router between the vlans. Note these test are vlan to vlan, e.g., through the router, and in the first 4 cases everything is 100 Mbps full duplex. Note the order of magnitude change in RTT, a factor of 4 in speed changes and no correlation between speed and RTT. Also note that loss never reaches 1%. In the last 2 cases, one of the hosts was changes to a 10 Mbps link. Note the order of magnitude change in RTT but speed remains constant and loss is again below 1%. Next page describes the network conditions present during the test.This example comes from a lab setup. 12 desktop computers all connected to the same Cisco switch with 2 vlans and a Cisco router between the vlans. Note these test are vlan to vlan, e.g., through the router, and in the first 4 cases everything is 100 Mbps full duplex. Note the order of magnitude change in RTT, a factor of 4 in speed changes and no correlation between speed and RTT. Also note that loss never reaches 1%. In the last 2 cases, one of the hosts was changes to a 10 Mbps link. Note the order of magnitude change in RTT but speed remains constant and loss is again below 1%. Next page describes the network conditions present during the test.

23. 11/9/04 23 100 Mbps FD Ave Rtt %loss loss/sec 5.41 0.00 0.03 1.38 0.78 15.11 6.16 0.00 0.03 14.82 0.00 0.10 10 Mbps 72.80 0.01 0.03 8.84 0.75 4.65 Speed 94.09 Good 22.50 Bad NIC 82.66 Bad reverse 33.61 Congestion 6.99 Good 7.15 Bad NIC Test resutls. Case 1, everything is operating normally with 100 Mbps full duplex links The router had a bad interface module, and it was reporting these errors in the router logs, note loss/sec rate In this case the TCP traffic is flowing in the opposite direction but the bad router interface is still present. (Who would report a problem?) In this case three pairs of hosts are testing at once, causing congestion on the shared router links (should be reported as normal) In this case one of the hosts is set to 10 Mbps. (normal operation) In this case the faulty router interface is again in the path. Note the increased loss/second rate, but speed is still good. Imagine what happens with GigE attached servers and FastE attached clients. Would anyone complain?Test resutls. Case 1, everything is operating normally with 100 Mbps full duplex links The router had a bad interface module, and it was reporting these errors in the router logs, note loss/sec rate In this case the TCP traffic is flowing in the opposite direction but the bad router interface is still present. (Who would report a problem?) In this case three pairs of hosts are testing at once, causing congestion on the shared router links (should be reported as normal) In this case one of the hosts is set to 10 Mbps. (normal operation) In this case the faulty router interface is again in the path. Note the increased loss/second rate, but speed is still good. Imagine what happens with GigE attached servers and FastE attached clients. Would anyone complain?

24. 11/9/04 24 Mathis et.al Formula fails Estimate = (K * MSS) / (RTT * sqrt(loss)) old-loss = (Retrans - FastRetran) / (DataPktsOut - AckPktsOut) new-loss = CongestionSignals / PktsOut Estimate < Measured (K = 1) old-loss 91/443 (20.54%) new-loss 35/443 (7.90%) This formula describes the normal operating mode for a Reno TCP connection. As noted, the NDT server is reporting that some connections don�t conform to this model. It isn�t clear why this discrepancy exists.This formula describes the normal operating mode for a Reno TCP connection. As noted, the NDT server is reporting that some connections don�t conform to this model. It isn�t clear why this discrepancy exists.

25. 11/9/04 25 NDT Hardware Requirements Minimum requirements 500 MHz Intel or AMD CPU 64 MB of RAM Fast Ethernet Buying something now 2 GHz or better processor 256 MB of RAM Gigabit Ethernet Disk space for executables and log files No disk I/O involved during test

26. 11/9/04 26 NDT Software Requirements Web100 enhancements Linux kernel User library Other 3rd party SW needed to compile source Java SDK pcap library Client uses Java JRE (beware of version mismatch) NDT source file Test engine (web100srv) requires root authority

27. 11/9/04 27 Recommended Settings There are no settings or options for the Web based java applet. It allows the user to run a fixed set of tests for a limited time period Test engine settings Turn on admin view (-a option) If multiple network interfaces exist use �i option to specify correct interface to monitor (ethx) Simple Web server (fakewww) Use �l fn option to create log file

28. 11/9/04 28 Potential Risks Non-standard kernel required GUI tools can be used to monitor other ports Public servers generate trouble reports from remote users Respond or ignore emails Test streams can trigger IDS alarms Configure IDS to ignore NDT server

29. 11/9/04 29 Possible Alternatives Other tools that can perform client testing Several web sites offer the ability for a user to check PC upload/download speed. Internet2/Surfnet Detective NCSA Advisor

30. 11/9/04 30 Supplemental information

31. 11/9/04 31 NDT�s Web100 Based Approach Simple bi-directional test to gather E2E data Gather multiple data variables from server Compare measured performance to analytical values Translate network values into plain text messages Geared toward campus area network These are NDT goals. An analogy is that repetitive tests build up an historical record that can point out when changes occur (a depth of Measurement data). The NDT relies on multiple data variables (a breadth of measurement data) to achieve similar results.These are NDT goals. An analogy is that repetitive tests build up an historical record that can point out when changes occur (a depth of Measurement data). The NDT relies on multiple data variables (a breadth of measurement data) to achieve similar results.

32. 11/9/04 32 NDT Benefits End-user based view of network Can identify configuration problems Can identify performance bottlenecks Provides some �hard evidence� to users and network administrators to reduce finger pointing Doesn�t rely on historical data These are some of the benefits of the NDT system. Providing hard evidence is an important part of making the user feel that something can be done to improve things.These are some of the benefits of the NDT system. Providing hard evidence is an important part of making the user feel that something can be done to improve things.

33. 11/9/04 33 NDT methodology Identify specific problem(s) that affect end users Analyze problem to determine �Network Signature� for this problem Provide testing tool to automate detection process This introduces the audience to the NDT operation methodology. The next few slides provide the details.This introduces the audience to the NDT operation methodology. The next few slides provide the details.

34. 11/9/04 34 IEEE 802.11 (WiFi) Detection Detect when host is connected via wireless (wifi) link Radio signal changes strength NICs implement power saving features Multiple standards (a/b/g/n) Some data has been collected

35. 11/9/04 35 Faulty Hardware/Link Detection Detect non-congestive loss due to Faulty NIC/switch interface Bad Cat-5 cable Dirty optical connector Preliminary works shows that it is possible to distinguish between congestive and non-congestive loss There has been some preliminary work done on detecting this problem. At one point I did find a bad router interface in a test network.There has been some preliminary work done on detecting this problem. At one point I did find a bad router interface in a test network.

36. 11/9/04 36 Full/Half Link Duplex setting Detect half-duplex link in E2E path Identify when throughput is limited by half-duplex operations Preliminary work shows detection possible when link transitions between blocking states This is also an area where more work needs to be performed. The issue is max performance, where a half duplex link will not achieve as high a speed as a full duplex link. Note: that old ethernet hubs require half-duplex operation.This is also an area where more work needs to be performed. The issue is max performance, where a half duplex link will not achieve as high a speed as a full duplex link. Note: that old ethernet hubs require half-duplex operation.

37. 11/9/04 37 Normal congestion detection Shared network infrastructures will cause periodic congestion episodes Detect/report when TCP throughput is limited by cross traffic Detect/report when TCP throughput is limited by own traffic This is another area where more work is required. The issue is to detect when your traffic is sharing the network infrastructure with other users. In this case you should get 1/Nth of the bottleneck link speed. It would also be nice to know when TCP is entering the congestion avoidance phase.This is another area where more work is required. The issue is to detect when your traffic is sharing the network infrastructure with other users. In this case you should get 1/Nth of the bottleneck link speed. It would also be nice to know when TCP is entering the congestion avoidance phase.

Developing the Web100 Based Network Diagnostic Tool NDT

Developing the Web100 Based Network Diagnostic Tool NDT

Presentation Transcript

Developing the Web100 Based Network Diagnostic Tool (NDT)

Developing the Web100 Based Network Diagnostic Tool (NDT)

The leanmail Diagnostic Tool™

Developing Web100 Based Network Configuration & Performance Measurement Tools

Developing the Web100 Based Network Diagnostic Tool (NDT)

Developing the Web100 Based Network Diagnostic Tool (NDT)

Developing the Web100 Based Network Diagnostic Tool (NDT)

BNL’s Network diagnostic tool

Web100

Web100

Classroom Diagnostic Tool Update

Developing the Web100 Based Network Diagnostic Tool (NDT)

Car Diagnostic Tool

Diagnostic scan tool

Diagnostic Scanner Tool

Diagnostic Scan Tool

car diagnostic tool

BNL’s Network diagnostic tool

Developing the Web100 Based Network Diagnostic Tool (NDT)

QuickBooks Install Diagnostic Tool

QuickBooks Install Diagnostic Tool

Diagnostic scan tool and diesel diagnostic tool

Developing the Web100 Based Network Diagnostic Tool NDT