1 / 22

Network Path and Application Diagnostics

NPAD/Pathdiag is a one-click automatic performance diagnosis tool designed for end-users. It accurately identifies and distinguishes between host and path flaws, helping users understand and resolve network performance problems. It also provides backup documentation for support escalation and recalibrates user expectations.

collinsb
Download Presentation

Network Path and Application Diagnostics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network Path andApplication Diagnostics Matt Mathis John Heffner Ragu Reddy 7/17/06 http://www.psc.edu/~mathis/papers/ PathDiag20060717.ppt (Corrected)

  2. Outline • NPAD/Pathdiag - Why should you care? • What are the real performance problems? • Automatic diagnosis • Deployment

  3. NPAD/Pathdiag - Why should you care? • One click automatic performance diagnosis • Designed for (non-expert) end users • Accurate end-systems and last mile diagnosis • Eliminate most false pass results • Accurate distinction between host and path flaws • Accurate and specific identification of most flaws • Basic networking tutorial info • Help the end user understand the problem • Help train 1st tier support (sysadmin or netadmin) • Backup documentation for support escalation • Empower the user to get it fixed • The same reports for users and admins

  4. Recalibrate user expectations • Long history of very poor network performance • Users do not know what to expect • Users have become completely numb • Goal for new baseline user expectations: • 1 Gigabyte in less than 2 minutes (~67 Mb/s) • Everyone should be able to reach these rates by default • People who can’t should know why or be very angry

  5. The Wizard Gap

  6. The Wizard Gap Updated • Experts have topped out end systems & links • 10 Gb/s NIC bottleneck • 40 Gb/s “link” bandwidth (striped) • Median I2 bulk rate is 3 Mbit/s • See http://netflow.internet2.edu/weekly/ • Current Gap is about 3000:1 • Closing the first factor of 30 should now be “easy”

  7. TCP tuning requires expert knowledge • By design TCP/IP hides the ‘net from upper layers • TCP/IP provides basic reliable data delivery • The “hour glass” between applications and networks • This is a good thing, because it allows: • Invisible recovery from data loss, etc • Old applications to use new networks • New application to use old networks • But then (nearly) all problems have the same symptom • Less than expected performance • The details are hidden from nearly everyone

  8. TCP tuning is painful debugging • All problems reduce performance • But the specific symptoms are hidden • Any one problem can prevent good performance • Completely masking all other problems • Trying to fix the weakest link of an invisible chain • General tendency is to guess and “fix” random parts • Repairs are sometimes “random walks” • Repair one problem at time at best • The solution is to instrument TCP

  9. The Web100 project • Instrumentation and autotuning for TCP • TCP has the ideal diagnostic vantage point • TCP-ESTATS-MIB now past IETF WG last-call • Will be a standard track RFC soon • Prototypes for Linux (www.Web100.org) and Windows Vista • TCP Autotuning • Automatically adjusts TCP buffers • Linux 2.6.17 default maximum window size is 4 M Bytes • Announced for Vista - details unknown • New insight • Nearly all symptoms scale with round trip time

  10. Nearly all symptoms scale with RTT • For example • TCP Buffer Space, Network loss and reordering, etc • On a short path TCP can compensate for the flaw • Local Client to Server: all applications work • Including all standard diagnostics • Remote Client to Server: all applications fail • Leading to faulty implication of other components

  11. The confounded problems • For nearly all network flaws • The only symptom is reduced performance • But the reduction is scaled by RTT • Therefore, flaws are undetectable on short paths • False pass for even the best conventional diagnostics • Leads to faulty inductive reasoning about flaw locations • Diagnosis often relies on tomography and complicated inference techniques • This is the real end-to-end problem

  12. The NPAD solution: • For applications (and upper layers) • Bench test over an (emulated) ideal long path • Topic of a future talk • “Pathdiag” tests short path sections to localize a flaw • Use Web100 to collect detailed statistics • Loss, delay, queuing properties, etc • Use models to extrapolate results to the full path • Assume that the rest of the path is ideal • You have to specify the end-to-end performance goal • Data rate and RTT • Pass/Fail on the basis of the extrapolated performance

  13. Deploy as a Diagnostic Server • Use pathdiag in a Diagnostic Server (DS) • Specify End to End target performance • From server (S) to client (C) (RTT and data rate) • Measure the performance from DS to C • Use Web100 in the DS to collect detailed statistics • On both the path and client • Extrapolate performance assuming ideal backbone • Pass/Fail on the basis of extrapolated performance

  14. Demo Laptop PSC

  15. Key NPAD/pathdiag features • Results are intended for end-users • Provides a list of specific items to be corrected • Failed tests are showstoppers for fast apps • Includes explanations and tutorial information • Clear differentiation between client and path problems • Accurate escalation to network or system admins • The reports are public and can be viewed by either • Coverage for a majority of OS and last-mile network flaws • Most of the remaining flaws can be detected with pathdiag in the client or traceroute • Eliminates nearly all(?) false pass results

  16. More features • Tests becomes more sensitive as the path gets shorter • Conventional diagnostics become less sensitive • Depending on models, perhaps too sensitive • New problem is false fail (e.g. queue space tests) • Flaws no longer completely mask other flaws • A single test often detects several flaws • E.g. find both OS and network flaws in the same test • They can be repaired concurrently • Archived DS results include raw web100 data • Can reprocess with updated reporting SW • New reports from old data • Critical feedback for the NPAD project • We really want to collect “interesting” failures

  17. NPAD/pathdiag deployment • Why should a campus networking organization care? • “Zero effort” solution to miss-tuned end-systems • Accurate reports of real problems • You have the same view as the user • Saves time when there really is a problem • You can document reality for management • Suggestion: • require pathdiag reports for all performance problems

  18. What about impact of the test traffic? • NPAD/pathdiag is single threaded • Only one test at a time • Same load as any well tuned TCP application • Protected by TCP “fairness” • Large flows are generally “softer” than small flows • Large flows are easily disturbed by small flows

  19. Impact • Automatically diagnose first level problems • Easily expose all path bottlenecks that limit performance to less than 100 Mb/s • Easily expose all end-system/OS problems that limit performance to less than 100 Mb/s • (Will become moot as autotuning is deployed) • Empower the users to apply the proper motivation • Still need to recalibrate user expectations • Less than 1 gigabyte / 2 minutes is too slow • Many paths should support 5 gigabytes/minute • Less than 1 Gb/s

  20. Download and install • User documentation: http://www.psc.edu/networking/projects/pathdiag/ • Follow the link to “Installing a Server” • Easily customized with a site specific skin • Designed to be easily upgraded with new releases • Roughly every 2 months • Improving reports through ongoing field experience • Drops into existing NDT servers • Plans for future integration • Enjoy!

  21. Backup slides

  22. Blast from the past • Same base algorithm as “Windowed Ping” [Mathis, INET’94] • Aka “mping” • See http://www.psc.edu/~mathis/wping/ • Killer diagnostic in use at PSC in the early 90s • Stopped working with the advent of “fast path” routers • Use a simple fixed window protocol • Scan window size in 1 second steps • Measure data rate, loss rate, RTT, etc as window changes

More Related