220 likes | 241 Views
Network Path and Application Diagnostics. Matt Mathis John Heffner Ragu Reddy 7/17/06 http://www.psc.edu/~mathis/papers/ PathDiag20060717.ppt (Corrected). Outline. NPAD/Pathdiag - Why should you care? What are the real performance problems? Automatic diagnosis Deployment.
E N D
Network Path andApplication Diagnostics Matt Mathis John Heffner Ragu Reddy 7/17/06 http://www.psc.edu/~mathis/papers/ PathDiag20060717.ppt (Corrected)
Outline • NPAD/Pathdiag - Why should you care? • What are the real performance problems? • Automatic diagnosis • Deployment
NPAD/Pathdiag - Why should you care? • One click automatic performance diagnosis • Designed for (non-expert) end users • Accurate end-systems and last mile diagnosis • Eliminate most false pass results • Accurate distinction between host and path flaws • Accurate and specific identification of most flaws • Basic networking tutorial info • Help the end user understand the problem • Help train 1st tier support (sysadmin or netadmin) • Backup documentation for support escalation • Empower the user to get it fixed • The same reports for users and admins
Recalibrate user expectations • Long history of very poor network performance • Users do not know what to expect • Users have become completely numb • Goal for new baseline user expectations: • 1 Gigabyte in less than 2 minutes (~67 Mb/s) • Everyone should be able to reach these rates by default • People who can’t should know why or be very angry
The Wizard Gap Updated • Experts have topped out end systems & links • 10 Gb/s NIC bottleneck • 40 Gb/s “link” bandwidth (striped) • Median I2 bulk rate is 3 Mbit/s • See http://netflow.internet2.edu/weekly/ • Current Gap is about 3000:1 • Closing the first factor of 30 should now be “easy”
TCP tuning requires expert knowledge • By design TCP/IP hides the ‘net from upper layers • TCP/IP provides basic reliable data delivery • The “hour glass” between applications and networks • This is a good thing, because it allows: • Invisible recovery from data loss, etc • Old applications to use new networks • New application to use old networks • But then (nearly) all problems have the same symptom • Less than expected performance • The details are hidden from nearly everyone
TCP tuning is painful debugging • All problems reduce performance • But the specific symptoms are hidden • Any one problem can prevent good performance • Completely masking all other problems • Trying to fix the weakest link of an invisible chain • General tendency is to guess and “fix” random parts • Repairs are sometimes “random walks” • Repair one problem at time at best • The solution is to instrument TCP
The Web100 project • Instrumentation and autotuning for TCP • TCP has the ideal diagnostic vantage point • TCP-ESTATS-MIB now past IETF WG last-call • Will be a standard track RFC soon • Prototypes for Linux (www.Web100.org) and Windows Vista • TCP Autotuning • Automatically adjusts TCP buffers • Linux 2.6.17 default maximum window size is 4 M Bytes • Announced for Vista - details unknown • New insight • Nearly all symptoms scale with round trip time
Nearly all symptoms scale with RTT • For example • TCP Buffer Space, Network loss and reordering, etc • On a short path TCP can compensate for the flaw • Local Client to Server: all applications work • Including all standard diagnostics • Remote Client to Server: all applications fail • Leading to faulty implication of other components
The confounded problems • For nearly all network flaws • The only symptom is reduced performance • But the reduction is scaled by RTT • Therefore, flaws are undetectable on short paths • False pass for even the best conventional diagnostics • Leads to faulty inductive reasoning about flaw locations • Diagnosis often relies on tomography and complicated inference techniques • This is the real end-to-end problem
The NPAD solution: • For applications (and upper layers) • Bench test over an (emulated) ideal long path • Topic of a future talk • “Pathdiag” tests short path sections to localize a flaw • Use Web100 to collect detailed statistics • Loss, delay, queuing properties, etc • Use models to extrapolate results to the full path • Assume that the rest of the path is ideal • You have to specify the end-to-end performance goal • Data rate and RTT • Pass/Fail on the basis of the extrapolated performance
Deploy as a Diagnostic Server • Use pathdiag in a Diagnostic Server (DS) • Specify End to End target performance • From server (S) to client (C) (RTT and data rate) • Measure the performance from DS to C • Use Web100 in the DS to collect detailed statistics • On both the path and client • Extrapolate performance assuming ideal backbone • Pass/Fail on the basis of extrapolated performance
Demo Laptop PSC
Key NPAD/pathdiag features • Results are intended for end-users • Provides a list of specific items to be corrected • Failed tests are showstoppers for fast apps • Includes explanations and tutorial information • Clear differentiation between client and path problems • Accurate escalation to network or system admins • The reports are public and can be viewed by either • Coverage for a majority of OS and last-mile network flaws • Most of the remaining flaws can be detected with pathdiag in the client or traceroute • Eliminates nearly all(?) false pass results
More features • Tests becomes more sensitive as the path gets shorter • Conventional diagnostics become less sensitive • Depending on models, perhaps too sensitive • New problem is false fail (e.g. queue space tests) • Flaws no longer completely mask other flaws • A single test often detects several flaws • E.g. find both OS and network flaws in the same test • They can be repaired concurrently • Archived DS results include raw web100 data • Can reprocess with updated reporting SW • New reports from old data • Critical feedback for the NPAD project • We really want to collect “interesting” failures
NPAD/pathdiag deployment • Why should a campus networking organization care? • “Zero effort” solution to miss-tuned end-systems • Accurate reports of real problems • You have the same view as the user • Saves time when there really is a problem • You can document reality for management • Suggestion: • require pathdiag reports for all performance problems
What about impact of the test traffic? • NPAD/pathdiag is single threaded • Only one test at a time • Same load as any well tuned TCP application • Protected by TCP “fairness” • Large flows are generally “softer” than small flows • Large flows are easily disturbed by small flows
Impact • Automatically diagnose first level problems • Easily expose all path bottlenecks that limit performance to less than 100 Mb/s • Easily expose all end-system/OS problems that limit performance to less than 100 Mb/s • (Will become moot as autotuning is deployed) • Empower the users to apply the proper motivation • Still need to recalibrate user expectations • Less than 1 gigabyte / 2 minutes is too slow • Many paths should support 5 gigabytes/minute • Less than 1 Gb/s
Download and install • User documentation: http://www.psc.edu/networking/projects/pathdiag/ • Follow the link to “Installing a Server” • Easily customized with a site specific skin • Designed to be easily upgraded with new releases • Roughly every 2 months • Improving reports through ongoing field experience • Drops into existing NDT servers • Plans for future integration • Enjoy!
Blast from the past • Same base algorithm as “Windowed Ping” [Mathis, INET’94] • Aka “mping” • See http://www.psc.edu/~mathis/wping/ • Killer diagnostic in use at PSC in the early 90s • Stopped working with the advent of “fast path” routers • Use a simple fixed window protocol • Scan window size in 1 second steps • Measure data rate, loss rate, RTT, etc as window changes