350 likes | 601 Views
Web100. Wendy Huntoon - PSC Jim Ferguson - NCSA I2 Members Meeting May 2002. Outline. Project Overview Motivation: What is the problem Web100 Collaboration Progress to Date Standardization Process Code Release Code Capabilities Overview of Users Web100 Resources.
E N D
Web100 Wendy Huntoon - PSC Jim Ferguson - NCSA I2 Members Meeting May 2002
Outline • Project Overview • Motivation: What is the problem • Web100 Collaboration • Progress to Date • Standardization Process • Code Release • Code Capabilities • Overview of Users • Web100 Resources
Motivations: What’s the Problem? • High performance flows slower than line rate • Delays continue/increase even with higher bandwidth • TCP tuning issues are non-trivial • Poorly conceived stacks • Router/switch buffer queues inadequate • Slow start and AIMD algorithm • Eliminate/dramatically reduce the “wizard gap” • Need for kernel instrumentation set for TCP variables
The Wizard Gap TCP over a long haul path Year Wizards Non-wizards Ratio • 1Mb/s 300kb/s 3:1 • 10Mb/s 1995 100Mb/s • 1Gb/s 3Mb/s 300:1 Scientists/researchers not happy with this
TCP tuning is painful debugging • All problems limit performance • IP routing, long round trip times • Improper MSS negotiations or path MTU discovery • IP Packet reordering • Packet losses, congestion, lame hardware • TCP sender or receive buffer space • Inefficient applications • Any one problem can mask all the others and confound all but the best (and few) tuning gurus • Need for better diagnostics and visibility into problems
Goal and Method • Make it “easy” (transparent) for non-experts to achieve higher throughput performance • Enhance TCP capabilities with better (finer grain) kernel instrumentation and automatic controls • Real time triage capability determines sender, receiver, and/or network bottlenecks
Why Focus on TCP • TCP has an ideal vantage point into throughput problem space • TCP can identify bottleneck subsystem(s) • TCP already measures the network (some) • TCP can measure the application • TCP can adjust itself (auto-tuning feedback)
Web100 Collaboration • Funded by the NSF • Currently Year 2 of a 3 Year grant. • Cisco URP for initial seed funding. • Collaborators • PSC (Matt Mathis, R. Reddy, Janet Brown, John Heffner) • NCAR (Peter O’Neil, Marla Meehl) • NCSA (John Estabrook, Tanya Brethour, Stephen Engelhardt, Jim Ferguson)
What is in the code • Web100 software consists of: • TCP Kernel Instrument Set (TPC-KIS) • Instruments coded directly in to the Operating System kernel. • Derived Instrument Set (DIS) • Information that is collected based on KIS parameters. • Application Code • Tools, applications, etc. that use the information provided by the KIS and DIS.
Kernel Instrument Set • Definition • Set of instruments designed to collect as much of the information as possible to enable a user to isolate the performance problems of a TCP connection. • How it is implemented • Each instrument is a variable in a "stats" structure that is linked through the kernel socket structure. • The Linux /proc interface is used to expose these instruments outside the kernel.
What is the TCP-KIS? • TCP-KIS instruments group naturally into categories. • Currently roughly 19 categories. • Already more than 125 instruments have been developed. • For each instrument: • Precise (standards ready) definition. • Instrument code in the kernel • Implementation verification tests • Does the kernel implementation meet the definition. • Prototype diagnostic tool(s) to demonstrate functionality and effectiveness.
TCP-KIS • Basic instrumentation examples • Connection ID: 5-tuple that uniquely identifies a connection. • State: determines what protocol features or algorithms are enabled. • Traffic out: statistics aggregate packets and traffic sent out on a connection.
Local Sender Triage • Group of instruments associated with the local sender. • Determine what subsystems are throttling TCP data transmission. • Three parallel sets of instruments that measure: • Receiver Window • Network Congestion • Senders Availability
Local Sender Groups • Other groups of instruments associated with the Local Sender: • Local Sender Congestion Model • Local Sender Loss Model • Local Sender Re-order Model • Local Sender RTT • Local Sender Segment Size • Local Sender Bottlenecks • Local Sender Tuning
Other Instruments • Similar instruments for the Local Receiver. • Observed Receiver instruments • Often inferred from the data stream. • E.g, Observed Receiver - receivers state is inferred from the ACK stream. • Application Interface • Future instruments to collect statistics on how the application is using the network.
Userland Distribution • Released asynchronously with kernel distribution • Currently at Alpha 1.1 • Version 1.2 release imminent • Consists of • The web100 library • Command line utilities • GUI utilities
Web100 Library • Web100 kernel exposes critical TCP variables/instruments through /proc • Web100 library provides the necessary access functions to access these variables/instruments • Functions • Read the value of a variable/instrument • Snap shot of a group (facilitates atomic reading of a group of variables) • Modify tunable variables (ex. send buffer size) • Etc …
Utilities • Command line utilities • Useful in batch scripts • Serve as demo codes for the usage of web100 library • GUI utilities • Based on GTK+ • Useful for troubleshooting network applications • Serve as examples for application developers
Timeline - Year 1 • Alpha code development • Establish User Support • www.web100.org • Initial User Community • Very limited to begin with. • Knowledgeable users, expected to provide technical input on the code. • Understand and develop applications.
Timeline - Year 2 • Began standardization process. • Develop MIB • Submit to IETF • Develop public code • Fix bugs in alpha versions • Add instrumentation • Code release • Continue code development • Identify and add new instruments
Code Releases - To date • Initial Release • Alpha0.2, released May 23, 2001 • Alpha0.3, released Sept. 19, 2001 • Alpha 1.0-Separation of Kernel and Userland code • Kernel Patch: • Alpha 1.1 for Linux 2.4.16, released March 18, 2002 • Alpha 1.0, released March 1, 2002 • Alpha 1.0, released February 26,2002 • Userland: • Alpha 1.1, released February 28, 2002 • Alpha 1.0, released February 26,2002
Timeline - Year 3 • New pathprobe diagnostic tool (wip, unreleased). • Add another 10-12 instruments. • Review instruments and code with other wizards. • Gain vendor support for ideas and code. • Finalize IETF draft by December IETF meeting.
Milestones • Over a year of ~ 30 alpha testers • Including: SLAC, ORNL, LBNL, and universities • www.net100.org • Modified Linux kernel supports 2.4.16 • Separation between KIS and library functions • draft-ietf-tsvwg-tcp-mib-extension-00.txt • draft-ietf-ipngwg-rfc2012-update-01.txt
Web100 Collaborator Activity • Rich Carlson, ANL • Tom Dunnigan, ORNL • Tom Hacker, U. of Michigan • Doug Chang, SLAC • Andreas Burkhardt & Matt Grob, Qualcomm • Larry Dunn & Scott Dier, Cisco/U. of Minnesota • Jason Lee, LBL
Collaborator Assistance • Bugs! • Kernel • Utilities • Release • Request new features • Review and criticize documentation • Way too easy on us
Collaborator Activity • Carlson/ANL working on a troubleshooting guide for LANs. • Set up network of 13 identically equipped PIII connected via Cisco 5500 network switch, running Web100-enabled Linux. • Introduces typical network faults (duplex mismatches, other config errors) and analyzes data for “signatures” of these faults. • Modified Iperf 1.2 to collect variables and reverse flow.
Collaborator Activity • Dunnigan/ORNL has found web100 helpful in seeing losses/retransmission and congestion avoidance parameters of individual TCP flows, and for tuning flows • Has developed a Web100-enabled ttcp • Has developed a daemon that logs web100 variables for designated paths when a flow closes • Has developed an autotuning daemon that uses web100 to tune flows, including modifications to web100 to support "event notification", so the daemon knows when a new flow/socket is opened
Collaborator Activity • Hacker/U.Michigan has been using the web100 software to help tune and diagnose end-to-end network performance problems across the U-M campus network as well as across Abilene for the Visible Human and Atlas projects at U-M. • Chang/SLAC is looking to fix performance problem between Linux and Solaris machines.
Collaborator Activity • Qualcomm is using Web100 to measure TCP performance over certain types of high speed wireless links under development. Web100 is partially integrated into some other tools - in the sense that output reports are published automatically in a format similar to other tools Qualcomm uses. • Dunn/Cisco currently using Web100 for a class at U.Minnesota. Includes accounts on test machine at NCSA.
Collaborator Activity • Lee/LBL has obtained accounts at SLAC and ANL for WAN testing, and have co-located one of our machines in Washington D.C. to do testing over SuperNet. Still in the process of testing all this out. • Keith Jackson at LBL has written Python wrappers to the Web100 calls using swing.
Web100 Summary • Main WWW site: www.web100.org • Freely available software distribution • www.web100.org/download • hundreds of downloads • Please be cognizant of impacts on others • Please use, test, provide feedback, contribute code • IETF standards process to benefit all • Attention turning to working with OS vendors to incorporate standards enhancements into their stacks