660 likes | 745 Views
Infrastructure and Protocols for Dedicated Bandwidth Channels. Nagi Rao Computer Science and Mathematics Division Oak Ridge National Laboratory raons@ornl.gov. March 14, 2005 1 st Annul Workshop of Cyber Security and Information Infrastructure Research Group (CSIIR) and
E N D
Infrastructure and Protocols for Dedicated Bandwidth Channels Nagi Rao Computer Science and Mathematics Division Oak Ridge National Laboratory raons@ornl.gov March 14, 2005 1st Annul Workshop of Cyber Security and Information Infrastructure Research Group (CSIIR) and Information Operations Center (IOC) Oak Ridge, TN Research Sponsored by Department of Energy National Science Foundation Defense Advanced Research Agency
Collaborators • Steven Carter, Oak Ridge National Laboratory • Leon O. Chua, University of California at Berkeley • Jianbo Gao, University of Florida • Qishi Wu, Oak Ridge National Laboratory • William Wing, Oak Ridge National Laboratory Sponsors • Department of Energy • High-Performance Networking Program • National Science Foundation • Advanced Network Infrastructure Program • Defense Advanced Research Agency • Network Modeling and Simulation Program • Oak Ridge National Laboratory • Laboratory Directed R&D Program
Outline of Presentation • Network Infrastructure Projects • DOE UltraScienceNet • NSF CHEETAH • Dynamics and Control of Transport Protocols • TCP AIMD Dynamics • Analytical Results • Experimental Results • New Class of Protocols • Throughput Stabilization for Control • Transport Protocol • Probabilistic Quickest Path Problem • Quickest path algorithm • Probabilistic algorithm
Outline of Presentation • Network Infrastructure Projects • DOE UltraScienceNet • NSF CHEETAH • Dynamics and Control of Transport Protocols • TCP AIMD Dynamics • Analytical Results • Experimental Results • New Class of Protocols • Throughput Stabilization for Control • Transport Protocol • Probabilistic Quickest Path Problem • Quickest path algorithm • Probabilistic algorithm
Motivation for Networking Projects: Terascale Supernova Initiative (TSI) DOE large-scale science application • Science Objective: Understand supernova evolutions • DOE SciDAC Project: ORNL and 8 universities • Teams of field experts across the country collaborate on computations • Experts in hydrodynamics, fusion energy, high energy physics • Massive computational code • Terabyte in generated in a day currently • Archived at nearby HPSS • Visualized locally on clusters – only archival data • Desired network capabilities • Archive and supply massive amounts of data to supercomputers and visualization engines • Monitor, visualize, collaborate and steer computations Visualization channel Visualization control channel Steering channel
DOE UltraScience Net The Need • DOE large-scale science applications on supercomputers and experimental facilities require high-performance networking • Petabyte data sets, collaborative visualization and computational steering • Application areas span the disciplinary spectrum: high energy physics, climate, astrophysics, fusion energy, genomics, and others Promising Solution • High bandwidth and agile network capable of providing on-demand dedicated channels: multiple 10s Gbps to 150 Mbps • Protocols are simpler for high throughput and control channels Challenges: Several technologies need to be (fully) developed • User-/application-driven agile control plane: • Dynamic scheduling and provisioning • Security – encryption, authentication, authorization • Protocols, middleware, and applications optimized for dedicated channels Contacts: Bill Wing (wrw@ornl.gov) Nagi Rao (raons@ornl.gov)
DOE UltraScience Net Connects ORNL, Chicago, Seattle and Sunnyvale: • Dynamically provisioned dedicated dual 10Gbps SONET links • Proximity to several DOE locations: SNS, NLCF, FNL, ANL, NERSC • Peering with ESnet, NSF CHEETAH and other networks Data Plane User Connections: Direct connections to: core switches –SONET channels MSPP – Ethernet channels Utilize UltraScience Net hosts • Funded by U. S. DOE • High-Performance Networking Program • at Oak Ridge National Laboratory • – $4.5M for 3 years
Control-Plane • Phase I • Centralized VPN connectivity • TL1-based communication with core switches and MSPPs • User access via centralized web-based scheduler • Phase II • GMPLS direct enhancements and wrappers for TL1 • User access via GMPLS and web to bandwidth scheduler • Inter-domain GMPLS-based interface Bandwidth Scheduler • Computes path with target bandwidth • Is bandwidth available now? • Extension of Dijkstra’s algorithm • Provide all available slots • Extension of closed semi ring structure to sequences of reals • Both are polynomial-time algorithms • GMPLS does not have this capability Web-based User Interface and API • Allows users to logon to website • Request dedicated circuits • Based on cgi scripts
NSF CHEETAH:Circuit-switched High-speed End-to-End Transport ArcHitecture Objective: Develop the infrastructure and networking technologies to support a broad class of eScience projects and specifically the Terascale Supernova Initiative. Main Technical Components: • Optical network testbed • Transport protocols • Middleware and applications • Collaborative Project: $3.5M for 3 years • U. Virginia, ORNL, NC State, CUNY • Sponsor: National Science Foundation Contacts: Malathi Veeraraghavan(mv@cs.virginia.edu) Nagi Rao (raons@ornl.gov)
CHEETAH Project concept • Network: • Create a network that on-demand offers end-to-end dedicated bandwidth channels to applications • Operate a PARALLE network to existing high-speed IP networks – NOT AN ALTERNATIVE! • Transport protocols: • Design to take advantage of dedicated and dual end-to-end paths • IP path and dedicated channel • eScience Application Requirements: • High-throughput file/data transfers • Interactive remote visualization • Remote computational steering • Multipoint collaborative computation
To DC – Dragon NC To hosts MSPP GbE/ 10GbE card Control card GbE/10GbE Ethernet Switch OC192 card OC192 card NCSU NCSU/MCNC/NLR ORNL To hosts MSPP Tohosts GbE/ 10GbE card MSPP Control card GbE/10GbE Ethernet Switch OC192 card GbE/ 10GbE card OC192 card Control card GbE/10GbE Ethernet Switch Atlanta (NLR/SOX) CHEETAH: Initial Configuration Implements GMPLS protocols
Peering: UltraScience Net - CHEETAH • Peering • Coast-to-coast dedicated channels • Access to ORNL supercomputers and storage Applications: • TSI on larger scale
Outline of Presentation • Network Infrastructure Projects • DOE UltraScienceNet • NSF CHEETAH • Dynamics and Control of Transport Protocols • TCP AIMD Dynamics • Analytical Results • Experimental Results • New Class of Protocols • Throughput Stabilization for Control • Transport Protocol • Probabilistic Quickest Path Problem • Quickest path algorithm • Probabilistic algorithm
Transport Dynamics are Important • Data Transport: High bandwidth for large data transfers over dedicated channels • maintain suitable sending rate to achieve effective throughput • Control of end devices: Remote control of visualizations, computations and instruments • Jittery dynamics will destabilize the control loops • Will not be able to effectively execute interactive simulations
Study of Transport Dynamics Understanding of transport dynamics: • Analytically showed that TCP-AIMD contains chaotic regimes • concept of w-update map • Internet traces are shown to be both chaotic and stochastic • underlying process is anomalous diffusion. Development and tuning of protocols: • Protocols for stable flows of fixed rate: ONTCOU • Based on classical Robbins-Monro method • Transport protocols with statistical stability: RUNAT • Combination of AIAD and Kiefer-Wolfowitz method
Complicated TCP AIMD Dynamics - History Simulation Results: TCP-AIMD exhibits “complicated” trajectories TCP streams competing with each other (Veres and Boda 2000) TCP competing with UDP (Rao and Chua 2002) Analytical Results (Rao and Chua 2002): TCP-AIMD has chaotic regimes Developed state space analysis and Poincare maps Internet Measurements (2004): TCP-AIMD traces are a complicated mixture of stochastic and chaotic components Working Definition of Chaotic Trajectories: • Nearby starting points will result in trajectories that move far apart • at a rate determined by Lyapunov (>0) exponent • Trajectories are non-periodic for some starting points • The attractor is geometrically complicated
Simplified View: Dynamics of TCP Early loss slows throughput Slow start:a Congestion control:1/w • Transport Control Protocol Outline • Uses window mechanism to send W bytes/RTT • Dynamically adjusts W to network and receiver state • Keeps increasing if no loses • Keeps shrinking if losses are detected • Slow start phase: • W increase exponentially until or loss • Congestion Control: • Additively increase W with delivered packets • Multiplicatively decrease with loss time time time
Chaotic Dynamics of TCP • Competing TCP streams: Window dynamics are chaotic • Hard to predict – resemblance to random noise • Hard to conclude from experiments – nearby orbits move faraway later • Hard to characterize – chaotic attractor • Poincare map of two window sizes • Two-streams case • Four streams case Veres and Boda (2000) did not rigorously establish chaos in a formal sense • Attractor could have been generated by periodic orbit with large period • We repeated the simulation and found only quasi periodic trajectories
Noisy Nature of TCP(simulation) Router: uniform random drops TCP source destination • Simple random traffic generates complicated attractors • TCP reacts to network traffic randomness • Jittery end-to-end delays • Do not need chaos to generate complicated attractors • Poincare map of message delay vs. window size
TCP Competing with UDP (ns-2 simulation) 2Mb, 10ms,DT As CBR rate is varied TCP competing with UDP/CBR at the router generates a variety of dynamics 1.7Mb, 10ms,DT TCP/Reno Router 2Mb, 10ms,DT sink UDP/CBR W(t) Poincare phase plot: Window-size W(t) vs. pkt end-to-end delay D(t) time W(t) UDP/CBR=1Mbs D(t)
TCP Competing with UDP UDP/CBR: 0Mbs UDP/CBR: 1.0Mbs UDP/CBR:1.75Mbs UDP/CBR: 1.7Mbs UDP/CBR: 0.5Mbs UDP/CBR: 1.7Mbs
Summary of Our Analytical Results State-Space of TCP: • congestion window; packet delay including re-transmits; • acknowledgements since last MD; losses inferred since last AI • TCP-AIMD dynamics have two qualitatively different regimes • Regime one: high-lighted in usual TCP literature • increased with while • Regime two: high-lighted by and • decreases with • Its effect and duration is enhanced by network delay and high buffer occupancy • Trajectories move back and forth between these two regimes • We define Poincare that updates : w-update map M • M is 1-dimensional if Regime Two is short-lived • M is 2-dimensional and complicated if Regime Two is significant • M is qualitatively similar to tent map – generates chaotic trajectories
Dynamics of Transitions Between Regimes • map for long TCP transfers Regime 2 Regime 1 t t w w Both regimes are unstable – Eigenvalue analysis
M: w-update map Given value, gives its next updated values after some time period (not fixed) Regime 1: Regime 2: depends on the number of dropped packets - buffer occupancy at that time - delay between source and bottleneck buffer Result: M is parametrized, and each piece resembles twisted version of classical tent-map Rao, Gao and Chua, chapter in Complex Dynamics in Communications Networks, 2004
Internet Measurements – Joint work with Jianbo Gao Question 1: How relevant are previous simulation and analytical results on chaotic trajectories? Answer: Relevant from an analysis perspective to certain extent. Question2: Do actual Internet TCP measurement exhibit chaotic behavior? Answer: Yes. They are more complicated than chaotic (deterministic).
Internet Measurements Internet (net100) traces show that TCP-AIMD dynamics are complicated mixture of chaotic and stochastic regimes: • Chaotic – TCP-AIMD dynamics • Stochastic – TCP response to network traffic Basic Point: TCP Traces collected on all Internet connections showed complicated dynamics • classical “saw-tooth” profile is not seen even once • This is not a criticism against TCP, it was not intended for smooth dynamics
Cwnd time series for ORNL-LSU connection Connection: OC192 to Atlanta-Sox; Internet2 to Houston; LAnet to LSU Time series: cwnd=x(t) Collected at 1ms (approx) resolutions collected using net100 instruments
Time-Dependent Exponent Plots • Informally, a measure of how separated close-by states become in time: • Exponential separation is characteristic of chaotic regime Form state vectors of size m from time series x(t), sampled denoted by x(1), x(2), …. For a two state vectors satisfying we define time-dependent exponent as Uniform Random Spread out Lorenz – chaotic Common envelope
Internet cwnd measurements: Both Stochastic and Chaotic Parts are Dominant • TCP traces have: • Common envelope – chaotic • Spread out – stochastic • at certain scales • Observations: • From analysis, chaotic dynamics are from AIMD • Stochastic component is in response to network traffic; losses and RTT variations Gao and Rao, IEEE Comm Letters, 2005,in press
Design of Transport Protocols with Smooth Dynamics Observation 1: Avoid AIMD-like behavior to avoid chaotic dynamics Challenge: Randomness is inherent in Internet connections – will not go away even if protocol is non-chaotic. Our Solution: Explicitly account for randomness in the protocol design – stochastic approximation
Throughput Stabilization • Niche Application Requirement: Provide stable throughput at a target rate - typically much below peak bandwidth • High-priority channels • Commands for computational steering and visualization • Control loops for remote instrumentation • TCP AIMD is not suited for stable throughput • Complicated dynamics • Underflows with sustained traffic • Important Consideration • Stochasticity of Internet connections must be explicitly accounted for Rao, Wu and Iyengar, IEEE Comm Letters, 2004
Stochastic Approximation: UDP window-based method Transport control loop Objective: adjust source rate to achieve (almost) fixed goodput at the destination application Difficulty: data packets and acks are subject to random processes Approach: Rely on statistical properties of data paths
UDP-Based Framework Send datagrams and wait for period Source Sending rate: Destination goodput: Loss rate Goodput regression: Loss regression:
Channel Throughput profile Plot of receiving rate as a function of sending rate Its precise interpretation depends on: • Sending and receiving mechanisms • Definition of rates For protocol optimizations, it is important to use its own sending mechanism to generate the profile Window-based sending process for UDP datagrams: Send datagrams in a one step – window size Wait for time called idle-time or wait-time Sending rate at time resolution : This is an adhoc mechanism facilitated by 1GigE NIC
Throughput Profile:Throughput and loss rates vs. sending rate (window size, cycle time) Typical day Christmas day Peak zone Stabilization zone Objective: adjust source rate to yield the desired throughput at destination
Adaptation of source rate • Sending process: send datagrams and wait for duration • Adjust the window size • Adjust cycle-time • Both are special cases of classical Robbins-Monroe method target throughput noisy estimate
Performance Guarantees • Summary: Stabilization is achieved with a high probability with a very simple estimation of source rate • Basic result: for the general update • We have
Internet Measurements • ORNL-LSU connection (before recent upgrade) • Hosts with 10 M NIC • 2000 mile network distance • ORNL-NYC – ESnet • NYC-DC-Hou – Abilene • HOU-LSU – Local n/s • ORNL-GaTech Connection • Hosts with GigE NICS • ORNL-Juniper router – 1Gig link • Juniper- ATL Sox – OC192 (1Gig link) • Sox-GaTech – 1Gig link
Goodput Stabilization: ORNL-LSUExperimental Results • Case 2. Target goodput = 2.0 Mbps, rate control through congestion window, a = 0.8, • Case 1: Target goodput = 1.0 Mbps, rate control through congestion window, a = 0.8, Datagram acknowledging time ( ) vs. source rate (Mbps) & goodput (Mbps) Datagram acknowledging time ( ) vs. source rate (Mbps) & goodput (Mbps)
Throughput Stabilization: ORNL-GaTech Target goodput = 20.0 Mbps, a = 0.8, adjust congestion window size Target goodput level = 2.0 Mbps, a = 0.8, , adjust sleep time
RUNAT: Reliable UDP-based Network Adaptive Transport • Transport protocol • Maximize connection utilization: Track peak goodput • Uses Keifer-Wolfowitz stochastic approximation to handle ACKs and losses Features: • Tailored to random loss rate and RTT • Segmented rate control • 3 control zones: bottleneck link is underutilized, saturated, and overloaded • Explicit accounting for random components • Use stochastic approximation methods based on goodput estimates • TCP-friendliness • Rate-increasing and rate-decreasing coefficients are dynamically adjusted • Adaptable to diverse network environments • Measurements and control periods are not constant, but link-specific (use RTT). Wu and Rao, INFOCOM2005
Three Zone of Goodput Profile • Three control zones • Zone I: Adaptive Increase • Bottleneck link is underutilized • Low packet loss due to occasional congestion or transmission errors • Fixed with increasing source rate • Zone II (transitional): dynamic KWSA method • Bottleneck link is saturated • Peak goodput falls within this zone • SA determines whether to increase or decrease source rate • Zone III: Adaptive Decrease • Bottleneck link is overloaded • Large packet loss due to network congestion • Back off to recover from congestion collapse Zone II low loss Stabilize sending rate at Goodput regression Zone III high loss Zone I ~zero loss sending rate r
Segmented Rate Control Algorithm Loss rate estimate: Basic Idea: Control sending rate based on loss rate estimate to achieve peak goodput when when when
Convergence Properties of RUNAT • Informal Statement: • If in zones I or III, it will exit to zone II • If in zone II, it will converge to maximum throughput Condition A1: loss statistics vary slowly Condition A2: loss regression is differentiable and its derivative is monotonically increasing with respect to r in Phase II. Result: RUNAT in zone I or III, enters II in a finite number of steps almost surely; In zone II, RUNAT will almost surely converge to the peak goodput
Experimental Results on link between ozy4 (ORNL) and robot (LSU)- Illustration of microscopic RUNAT behaviors during transfer of 20MB data The increment of source rate is determined by congestion levels (local loss rate measurements) and . The decrement of source rate upon packet loss is determined by congestion levels (local loss rate measurements) and : higher congestion levels result in larger rate drops. When far away from the saturation (peak) point, is adjusted to large values to quickly move towards the peak point. When approaching the saturation (peak) point, is adjusted to small values to slowly converge to and remain at the peak point. Zone I (loss rate: 0%) Zone III (loss rate: 37.33%) Slow Start Zone II (loss rate: 3.33%)
Experimental Results on link between ozy4 (ORNL) and robot (LSU)- RUNAT transport performance during transfer of 2GB data with concurrent TCP transfer of 50MB data Case 1: run RUNAT & TCP concurrently RUNAT throughput: 10.49Mbps Note: The low throughputs were due to the high traffic volume at the time of experiments. In a normal day with regular traffic volume, TCP is able to achieve 3~6Mbps and RUNAT may reach 15~30Mbps at lower loss rates without significantly affecting concurrent TCP on this link. TCP throughput: 0.376Mbps Case 2: run a single TCP only Single TCP throughput: 0.377Mbps
Experimental Resultson link from ozy4 (ORNL) to orbitty (NC State)
ORNL-Atlanta-ORNL 1Gbps Channel Juniper M160 Router at ORNL Juniper M160 Router at Atlanta GigE Dell Dual Xeon 3.2GHz OC192 ORNL-ATL • Host to Router • Dedicated 1GigE NIC • ORNL Router • Filter-based forwarding to override both at input and middle queues and disable other traffic to GigE interfaces • IP packets on both GigE interfaces are forwarded to out-going SONET port • Atlanta-SOX router • Default IP loopback • Only 1Gbps on OC192 link is used for production traffic – 9Gbps spare capacity SONET blade GigE blade SONET blade IP loop GigE Dual Opteron 2.2 GHz
1Gbps Dedicated IP Channel Juniper M160 Router at ORNL Juniper M160 Router at Atlanta GigE Dell Dual Xeon 3.2GHz OC192 ORNL-ATL • Non-Uniform Physical Channel: • GigE – SONET – GigE • ~500 network miles • End-to-End IP Path • Both GigE links are dedicated to the channel • Other host traffic is handled through second NIC • Routers, OC192 and hosts are lightly loaded • IP-based Applications and Protocols are readily executed SONET blade GigE blade SONET blade IP loopback GigE Dual Opteron 2.2 GHz