Investigating New Loss-Based TCP Stacks for High Throughput Networks

Experience with Loss-Based Congestion Controlled TCP Stacks Yee-Ting Li University College London

Introduction • Transport of Data for next generation applications • Network hardware is capable of Gigabits per second • Current ‘Vanilla’ TCP not capable over long distances and high throughputs • New TCP Stacks have been introduced to rectify problem • Investigation into the performance, bottlenecks and deploy-ability of new algorithms

Transmission Control Protocol • Connection orientated • Reliable Transport of Data • Window based • Congestion and Flow Control to prevent network collapse • Provides ‘fairness’ between competing streams • 20 Years old • Originally designed for kbit/sec pipes

TCP Algorithms • Based on two algorithms to determine rate at which data is to be sent • Slowstart: probe for initial bandwidth • Congestion Avoidance: maintain a steady state transfer rate • Focus on Steady State: probe for increases in available bandwidth, whilst backing off if congestion is detected (through loss). • Maintained through a ‘congestion window’ cwnd that regulates the number of unacknowledged packets allowed on connection. • Size of window approx equals Bandwidth delay product • Determines the appropriate window size to set to obtain a bandwidth under a certain delay • Window = Bandwidth x Delay

Algorithms • Congestion Avoidance • For every packet (ack) received by sender • Cwnd  cwnd + 1/cwnd • For when loss is detected (through dupacks) • Cwnd  cwnd / 2 • Growth of cwnd determined by: • the RTT of the connection • When rtt is high, cwnd grows slowly (because of acking) • The loss rate on the line • High loss means that cwnd never achieved a large value • Capacity of the link • Allows for large cwnd value (when low loss)

Advantages Achieves good throughput Not changes to kernels required Disadvantages Have to manually tune the number of flows May induce extra loss on lossy networks Need to reprogram/recompile software Current Methods of Achieving High Throughput

New TCP Stacks • Modify the congestion control algorithm to improve response times • All based on modifying the cwnd growth and decrease values • Define: • a = increase of data packets per window of acks • b = decrease factor upon congestion • To maintain compatibility (and hence network stability and fairness), for small cwnd values: • Mode switch from Vanilla to New TCP

HSTCP • Designed by Sally Floyd • Determine a and b as a function of cwnd • a  a(cwnd) • b  b(cwnd) • Gradual improvement in throughput as we approach larger bandwidth delay products • Current implementation focused on performance upto 10Gb/sec – set linear relation between loss and throughput (response function)

Scalable TCP • Designed by Tom Kelly • Define a and b to be constant: • a: cwnd  cwnd + a (per ack) • b: cwnd  cwnd – b x cwnd • Intrinsic scaling property that has the same performance over any link (beyond the initial threshold) • Recommended settings • a = 1/100 • b = 1/8

H-TCP • Designed by Doug Leith and Robert Shorten • Define a mode switch so that after congestion we do normal Vanilla • After a predefined period ∆L, switch to a high performance a • ∆i≤ ∆L: a = 1 • ∆I> ∆L: a = 1 + (∆ - ∆L) + [(∆ - ∆L)/20]2 • Upon loss drop by • | [Bimax(k+1) - Bimax(k)] / Bimax(k) | > 0.2: b = 0.5 • Else: b = RTTmin/RTTmax

Implementation • All New Stacks have own implementation • Small differences between implementations means that we are comparing the kernel differences rather than just the algorithmic differences • Lead to development of ‘test platform’ kernel  altAIMD • Implements all three stacks via simple sysctl switch. • Also incorporates switches for certain undesirable kernel ‘features’ • moderate_cwnd() • IFQ • Added extra features for testing/evaluation purposes • Appropriate Byte Counting (RFC3465) • Inducible packet loss (at recv) • Web100 TCP logging (cwnd etc)

UCL Manchester StarLight CERN Cisco 7600 Cisco 7600 Juniper Cisco 7600 Cisco 7600 Cisco 7600 Networks Under Test • Networks MB-NG DataTAG Bottleneck Capacity 1Gb/sec RTT 120msec Bottleneck Capacity 1Gb/sec RTT 6msec

Graph/Demo • Mode switch between stacks on constant packet drop { { { Vanilla TCP Scalable TCP HS-TCP

Comparison against theory • Response function

Self Similar Background Tests • Results skewed • Not comparing differences in TCP algorithms! • Not useful results!

SACK … • Look into what’s happening at the algorithmic level: • Strange hiccups in cwnd  only correlation is SACK arrivals Scalable TCP on MB-NG with 200mbit/sec CBR Background

SACKS • Supplies the sender information about what segments the recv has • Sender infers the missing packets to resend • Aids recovery during loss and prevents timeouts • Current implementation in 2.4 and 2.6 does a walk through the entire sack list for each SACK • Very cpu intensive • Can be interrupted by arrival of next SACK which causes the SACK implementation to misbehave • Tests conducted with Tom Kelly’s SACK fast-path patch • Improves SACK processing, but still not sufficient

Periods of web100 silence due to high cpu utilization Logging done in userspace – kernel time taken up by tcp sack processing TCP resets cwnd SACK Processing overhead

Congestion Window Moderation • Linux TCP implementation adds ‘feature’ of moderate_cwnd() • Idea is to prevent large bursts of data packets under ‘dubious’ conditions • When an ACK acknowledges more than 3 packets (typically 2) • Adjusts cwnd to known number of packets ‘in-flight’ (plus extra 3 packets) • Under large cwnd sizes (high bandwidth delay products), throughput can be diminished as result

CPU Load and Throughput

90% TCP AF moderate_cwnd(): Vanilla TCP moderate_cwnd ON moderate_cwnd OFF CWND Throughput

moderate_cwnd(): HS-TCP moderate_cwnd OFF moderate_cwnd ON 70% TCP AF 90% TCP AF

moderate_cwnd(): Scalable-TCP moderate_cwnd OFF moderate_cwnd ON 70% TCP AF 90% TCP AF

Multiple Streams Aggregate BW CoV

10 TCP Flows versus Self-Similar Background Aggregate BW CoV

10 TCP Flows versus Self-Similar Background BG Loss per TCP BW

Impact • Fairness: ratio of throughput achieved by one stack against another • Means that a fairness against vanilla tcp is defined by how much more throughput a new stacks gets more than vanilla • Doesn’t really consider deploy-ability of the stacks in real life – how does these stacks affect the existing traffic? (mostly vanilla tcp) • Redefine fairness in terms of the Impact: • Consider the affect of the background traffic only under different stacks • Vary against number of TCP Flows to determine impact(vanilla flows) throughput of n-Vanilla flows • BW impact = throughput of (n-1) Vanilla flows + 1 new TCP flow

Impact of 1 TCP Flow Throughput Throughput Impact

1 New TCP Impact CoV

Impact of 10 TCP Flows Throughput Throughput Impact

10 TCP Flows Impact CoV

WAN Tests

Summary • Comparison of actual TCP differences through test platform kernel • Problems with SACK implementations mean that it is difficult under loss to maintain high throughput (>500Mbit/sec) • Other problems exist with kernel implementation that hinder performance • Compare stacks under different artificial (and hence repeatable) conditions • Single stream: • Multiple stream: • Need to study over wider range of networks • Move tests onto real production environments

Investigating New Loss-Based TCP Stacks for High Throughput Networks