Network-aware OS

Network-aware OS DOE/MICS Project Final Review September 16, 2004 Tom Dunigan thd@ornl.gov Matt Mathis mathis@psc.edu Brian Tierney bltierney@lbl.gov ORNL: Florence Fowler, Steven Carter, Nagi Rao, Bill Wing PSC: Raghu Reddy, John Heffner, Janet Brown LBNL: Jason Lee, Martin Stouffer

Roadmap www.net100.org • Motivation & Background • Net100 project components • Web100 • network probes & sensors • protocol analysis and tuning • Results • TCP tuning daemon • Tuning experiments • Project contributions • DOE-funded project (Office of Science) • $2.6M, 3 yrs beginning 9/01 • LBNL, ORNL, PSC, NCAR • Net100 project objectives: (network-aware operating systems) • measure, understand, and improve end-to-end network/application performance • tune network protocols and applications (grid and bulk transfer) • emphasis: TCP bulk transfer over high delay/bandwidth nets

Motivation • Poor network application performance • High bandwidth paths, but app’s slow • Is it application? OS? network? … Yes • Often need a network “wizard” • Changing: bandwidths • 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Gbs • Unchanging: TCP • speed of light (RTT) • packet size (MSS/MTU) still 1500 bytes • TCP congestion control • TCP is lossy by design ! • 2x overshoot at startup, sawtooth • Recovery proportional to MSS/RTT2 • recovery after a loss can be very slow on today’s high delay/bandwidth links -- unacceptable on tomorrow’s links: • 10 Gbs cross country: recovery time > 1 hr.!! Linear recovery at 0.5 Mb/s! Instantaneous bandwidth 8 Mbs Early startup losses Average bandwidth 40 seconds ORNL to NERSC ftp GigE/OC12 (600 Mbs) 80ms RTT

TCP 101 • adaptable and fair • flow-controlled by sender/receiver buffer sizes • self-clocking with positive ACK’s of in-sequence data • sensitive to packet size (MTU) and RTT • slow start -- +1 packet per each packet ACK’d (exponential) • congestion window (cwnd)-- max packets that can be in flight • packet loss: 3 dup ACKs or timeout (AIMD) • cut cwnd in half (Multiplicative Decrease) • add 1 packet to cwnd per RTT (Additive Increase) • Workarounds: • parallel streams • non-TCP (UDP) applications • Net100 (no changes to applications)

Net100 components • Web100 Linux kernel (NSF) • instrumented TCP stack (IETF MIB draft) • Path characterization • Network Tuning and Analysis Framework (NTAF) • both active and passive measurement tools • data base of measurements • TCP protocol analysis and tuning • simulation/emulation • ns • TCP-over-UDP (atou) • NISTNet • kernel tuning extensions • tuning daemon (WAD)

Web100 • NSF funded (PSC/NCAR/NCSA) web100.org • Modified Linux kernel • instrumented kernel to read/set TCP variables for a specific flow • readable: RTT, counts (bytes, pkts, retransmits,dups), state (SACKs, windowscale, cwnd, ssthresh) • settable: buffer sizes • 100+ TCP variables (IETF MIB) ( /proc/web100/) • GUI to display/modify a flow’s TCP variables, real-time • API for network-aware applications or tuning daemon • Net100 extensions: • additional tuning variables and algorithms • event notification for a tuning daemon • Java bandwidth testerhttp://cruise.ornl.gov:7123

Network Tool Analysis Framework (NTAF) • Configure and launch network tools • measure bandwidth/latency (iperf, pchar, pipechar) • augment tools to report Web100 data • Collect and transform tool results • use Netlogger to transform common format • Save results for short-term auto-tuning and archive for later analysis • compare predicted to actual performance • measure effectiveness of tools and auto-tuning • provide data that can be used to predict future performance • invaluable for comparing tools (pathload/pchar/netest) Net100 hosts at: LBNL,ORNL,PSC,NCAR NERSC, SLAC, UT, CERN, Amsterdam,ANL

TCP flow visualization - Web interface for data archive and visualization

TCP tuning • “enable” high speed • need buffer = bandwidth*RTT - autotuneORNL/NERSC (80 ms, OC12) need 6 MB • faster slow-start • avoid losses • modified slow-start • reduce bursts • anticipate loss (ECN,Vegas?) • reorder threshold • speed recovery • bigger MTU or “virtual MSS” • modified AIMD (0.5,1) (Floyd, Kelly) • delayed ACKs, initial window, slow-start increment • avoid congestion collapse, be fair (?) … intranets, QoS • Net100: ns simulation, NISTNet emulation, “almost TCP over UDP” (atou), WAD/Internet ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow start. Standard TCP with del ACK takes 10 minutes to recover!

TCP Tuning Daemon WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 sendstall: 0 delack: 0 floyd: 1 kellyai: 0 • Work-around Daemon (WAD) • tune unknowing sender/receiver at startup and/or during flow • Web100 kernel extensions • pre-set windowscale to allow dynamic tuning • uses netlink to alert daemon of socket open/close (or poll) • besides existing Web100 buffer tuning, new tuning parameters and algorithms • knobs to disable Linux 2.4 caching, burst mgt., and sendstall • config file with static tuning data • mode specifies dynamic tuning (AIMD options, NTAF buffer size, concurrent streams) • daemon periodically polls NTAF for fresh tuning data • can do out-of-kernel tuning (e.g., Floyd) • written in C (also Python version)

Experimental results • Evaluating the tuning daemon in the wild • emphasis: bulk transfers over high delay/bandwidth nets (Internet2, ESnet) • tests over: 10GigE/OC192,OC48, OC12, OC3, ATM/VBR, GigE+jumboframe,FDDI,100/10T,cable, ISDN,wireless (802.11b),dialup • tests over NISTNet testbed (speed, loss, delay) • Various TCP tuning options • buffer tuning (static, auto, and dynamic/NTAF) • AIMD mods (including Floyd, Kelly, Vegas, static, virtual MSS) • slow-start mods • parallel streams vs single tuned vs UDP transports NISTNet host

Buffer tuning • Classic buffer tuning • network-challenged app. gets 10 Mbs • same app., WAD/NTAF tuned buffer gets 143 Mbs ORNL to PSC, OC12, 80ms RTT • Autotuning buffers (kernel) • Linux 2.4, Feng’s Dynamic Right Sizing • Net100 autotuning • receiver estimates RTT • receiver advertises window 2 times data recv’d in RTT • buffer size grows dynamically to 2x bandwidth*RTT • separate application buffers from kernel buffers ORNL to PSC, OC192, 30 ms RTT

Speeding recovery • Virtual MSS • tune TCP’s additive increase (WAD_AI) • add k segments per RTT during recovery • k=6 like GigE jumbo frame, but: • interrupt rate not reduced • doesn’t do k segments for initial window Selectable TCP AIMD algorithms: Floyd HS TCP: as cwnd grows increase AI and decrease MD, do the reverse when cwnd shrinks Kelly scalable TCP: use MD of 1/8 instead of 1/2 and add % of cwnd (e.g. 1%) each RTT Amsterdam-Chicago GigE via 10GigE, 100 ms RTT UDP burst

WAD tuning • Modified slow-start and AI • often losses in slow-start • WAD tuned Floyd slow-start and fixed AI (6) ORNL to NERSC, OC12, 80 ms RTT • WAD-tuned AIMD and slow-start • parallel streams AIMD (1/(2k),k) • exploit TCP’s fairness • WAD-tuned single stream (0.125,4) • “ “ + Floyd slow-start ORNL to CERN, OC12, 150ms RTT

Clever Alice -- 3 streams Bad girl ... Workaround: parallel streams • Takes advantage of TCP’s fairness • Faster startup, k buffers • faster recovery • often only 1 stream loses a packet • MD: 1/(2k) rather than 1/2 • AI: k times faster linear phase • BUT • requires rewrite of applications • how many streams? Buffer size? • GridFTP, bbftp, psocket lib Alice and Bob sharing

GridFTP tuning Can tuned single stream compete with parallel streams? Mostly not with “equivalence” tuning, but sometimes…. Parallel streams have slow-start advantage. WAD can divide buffer among concurrent flows—fairer/faster? Tests inconclusive Testing on real Internet is problematic. Is there a “congestion metric”? Per unit of time? Flow Mbs congestion re-xmits untuned 28 4 30 tuned 74 5 295 parallel 52 30 401 untuned 25 7 25 tuned 67 2 420 parallel 88 17 440 Buffers: 64K I/O, 4MB TCP Data/plots from Web100 tracer

Recent Net100 research • more user-friendly WAD, WAD-lite • No NTAF, bandwidth test thread • invited to submit Web100/Net100 mods to Linux 2.6 • port to Cray X1 • Linux network front-end • added Net100 kernel, 4x improvement in wide-area TCP! • port to SGI Altix • TCP Vegas • Vegas avoids loss (if RTT increasing, Vegas backs off) • can be configured to compete with standard TCP (Feng) • CalTech’s FAST (adjusts alpha dynamically) • comparison with other “work arounds” • parallel streams • non-TCP (SABUL, FOBS, TSUNAMI, RBUDP, SCTP) • additional accelerants • slow-start initial/increment • reorder resiliance • delayed ACKs

TCP tuning for other OS’s • Reorder threshold • seeing more out of order packets • -future: multipath or bonded NICs • WAD tune a bigger reorder threshold for path • 40x improvement! • Linux 2.4 does a good job already • adjusts and caches reorder threshold • “undo” congestion avoidance • UDP transports don’t handle re-ordering well LBL to ORNL (using our TCP-over-UDP) : dup3 case had 289 retransmits, but all were unneeded! • Delayed ACKs • WAD could turn off delayed ACKs 2x improvement in recovery rate and slow-start • Linux 2.4 already turns off delayed ACKs for initial slow-start ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow-start. Standard TCP with del ACK takes 10 minutes to recover! NOTE aggressive static AIMD (Floyd pre-tune)

Scientific applications SciDAC supernova and global climate Data grids (CERN, SLAC) Middleware Globus/gridFTP HSI/HPSS Network measurement Internet2 end-to-end Pinger (Cottrell) Claffy/Dovrolis pathload netest (Guojun) SCNM Protocol research Dynamic Right-Sizing (Feng) HS TCP (Floyd) Scalable TCP (Kelly) TCP Vegas (Feng, Low) Tsunami/SABUL/FOBS/RBUDP parallel streams (Hacker) OS vendors Linux IBM AIX/Linux Cray X1 SGI Altix Talks/papers/software/ www.net100.org Interactions

Insights • Parallel streams are quite effective • No kernel mods, but need new app’s • Bypass system buffer limits • Faster slow-start and recovery, and still TCP-like • Rate-based UDP is effective • No kernel mods, but need new app’s • Sensitive to re-ordering • Many duplicate packets • Does software-based rate control in the application layer scale? • WAD and WAD-lite: nice for experimenting or QoS, hard for user • Configure auto-tuning and Floyd’s HS TCP • Vote for bigger MTUs

Summary • Novel approaches • non-invasive dynamic/auto tuning of legacy applications • out-of-kernel tuning • using TCP to tune TCP • tuning on a per flow/destination based on recent path metrics or policy (QoS) • Effective evaluation framework • protocol analysis and tuning • network/application/OS debugging • path characterization tools, archive, and visualization tools • Performance improvements • WAD tuned: • buffers  10x • AIMD  2x to 10x • delayed ACK 2x • slowstart  3x • reorder  40x • Timely -- needed for science on today’s and tomorrow’s networks

Network-aware OS

Network-aware OS

Presentation Transcript

Network Aware Forward Caching

Development of network-aware operating systems

Configuring a network os

Network Cost Services for Network-Aware FI Applications

Tunable QoS -Aware Network Survivability

Tunable QoS -Aware Network Survivability

Development of a Network-Aware Application

Network-aware OS

Net100: developing network-aware operating systems

Power-Aware Network Design

Network-aware OS

Network-Aware Data Movement Advisor

Towards Topology-aware Network Services

Energy Aware Network Operations

Network-aware OS

Network-aware OS

Locality Aware Network Solutions

Planning for Network-Aware Paths

Network-aware OS

Network Aware Module

Context-aware Services in Ubiquitous Network

Development of network-aware operating systems