Y. Kodama, R. Takano, F. Okazaki, and T. Kudoh

Improvement of Communication Performance of Linux TCP/IP by Fixing a Problemin Detection of Loss of Retransmission Y. Kodama, R. Takano, F. Okazaki, and T. Kudoh National Institute of Advanced Industrial Science and Technology (AIST), Japan PFLDnet 2008 @Manchester 7 March 2008

We observed large variation of goodput in this simple environment using Linux 2.6.17. Background (1) Bottleneck link 500Mbps, RTT 10ms GbE GbE Node B Node A • 1 GByte data was transferred from Node A to Node B • Goodput was so much different: 300 – 70 Mbps • The minimum goodput was less than 1/4 of the maximum goodput. PFLDnet 2008 @Manchester 7 March 2008

Communication frequently stopped in several seconds. • We found that these communication stops were caused by several problems in TCP implementation of Linux 2.6.17. Bandwidth (Mbps) Time (sec) Background (2) PFLDnet 2008 @Manchester 7 March 2008

X 5 X 2 • Linux kernel 2.6.23 fixed following problems • Using ABC (Appropriate Byte Count), cwnd was not increased in Loss state after RTO. • Some SACK blocks were destroyed in sorting of SACK blocks. Background (3) 20% PFLDnet 2008 @Manchester 7 March 2008

Expand • There were many communication stops for 200ms caused by RTOs (Retransmit Time Outs) . • This talk is focused on why RTOs occurred, and how improved the goodput. Background (4) PFLDnet 2008 @Manchester 7 March 2008

Outline • Evaluation environment and detailed examination of the RTOs • Reason ofthe RTOs and proposal of a fix to eliminate the RTOs • Evaluations of the fix in several network environments • Conclusion PFLDnet 2008 @Manchester 7 March 2008

Router (Cisco GSR12404) WAN Emulator (AIST GtrcNET-1) • 500Mbps by policing • 5ms one way delay: Node A Node B GbE output Evaluation Environment This burst tolerance is the reason why the bandwidth we measured was larger than the bottleneck link. policing Token bucket ・increment in according to target rate ・decrement in according to the transferred packet size input Target rate bandwidth Max burst Data packet time Token >= packet size? Yes: transfer No: discard PFLDnet 2008 @Manchester 7 March 2008

GtrcNET • A network testbed consists of a large scale FPGA • It is easy to program functions keeping wire rate speed • GtrcNET-1: GbE (GBIC) x 4ports + 16MBytes Memory/port • Many functions for network experiment have been implemented • Bandwidth measurement • (aggregate, per-stream) • Delay emulation • Packet capture • Test packet generation • Rate control • (pacing, shaping, policing) • GtrcNET-10: 10GbE (XENPAK) x 3ports + 1GBytes Memory /port http://www.gtrc.aist.go.jp/gnet/ PFLDnet 2008 @Manchester 7 March 2008

Detailed examination • We examined the detailed behavior from many point of view to figure out why RTOs occurred. • Bandwidth every 10ms • Sampling kernel variables, such as Cwnd, Ssthresh and RTO, every 10ms by Web100 • Data sequences, ACK sequences, SACK blocks by packet capturing • Dumping kernel variables at every function invocation • We could get very precise data • We had to recompile kernel every time when we changed the variables to be dumped • Dump size was limited to the kernel memory PFLDnet 2008 @Manchester 7 March 2008

Bandwidth was measured every 10ms by GtrcNET Bandwidth every 10ms PFLDnet 2008 @Manchester 7 March 2008

Cwnd and other variables were sampled every 10ms by Web100 Cwnd every 10ms Communication almost stopped in 200ms, and an RTO occured Timeouts Variables could not be sampled, because the kernel load was high. PFLDnet 2008 @Manchester 7 March 2008

Captured packets from/to node A to get data/ACK sequence • SACK for re-transmitted packets were received, but no packets were re-transmitted again. And an RTO occurred. • SACK were received and not-SACKed-packets were re-transmitted. X Data Seq ACK Seq SACK Start SACK End X X Data Sequence Data Sequence X X X Expand Time (sec) Time (sec) Data and ACK/SACK sequence PFLDnet 2008 @Manchester 7 March 2008

SACK (Selective ACK) • TCP/IP handshakes using acknowledgement packets (ACKs) for reliable communication • If the same ACK packets are received three times, the sender detects packet losses. • ACK has no information about which packets are lost • All packets after the DACK sequenceare retransmitted including packets that has been already received. • Performance of retransmission is low • Selective ACK mechanism was proposed to overcome this limitation • SACK has information about which packets are received after packet losses • The sender can retransmit only the lost packets PFLDnet 2008 @Manchester 7 March 2008

X X P(1) P(2) P(10) ACK(0), SACK {(1,2),(8,10)} ACK(0), SACK {(1,3),(8,10)} ACK(0), SACK {(8,11),(1,3)} Example of SACK snd_una snd_nxt 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ACKreceived Not send yet Packet Loss X X X X received send ACK with SACK • If a packet loss is detected, the receiver sends an ACK packet with SACK blocks. • A SACK block specifies the consecutive packet segments that are received. • When a new SACK block is sent, previous SACK blocks are also sent in the ACK packet. P(3) P(4) P(5) P(6) P(8) P(9) ACK(0), SACK {(3,4)} ACK(0), SACK {(3,5)} ACK(0), SACK {(3,6)} ACK(0), SACK {(3,7)} ACK(0), SACK {(8,9),(3,7)} ACK(0), SACK {(8,10),(3,7)} PFLDnet 2008 @Manchester 7 March 2008

snd_nxt snd_nxt < SACK(8,11) 0 10 a) Transmit order 0 10 b) ACK/SACK Ex. of detect loss of retransmission snd_una 0/ 10 2 1 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 3 7/ 11 10 a) SACK{(3,4)} b) SACK{(8,11)} Scoreboarding process 0/ 13 Retransmit (Good) Lost | Retransmitted m: ack_seq SACK(start_seq,end_seq) n n n/ m Lost n Transmitted Sacked • Current Linux kernel can detect loss of retransmission before RTO occurs. • However in our real experiments, we observed some RTOs.  retransmission of packet 0 is detected as lost. PFLDnet 2008 @Manchester 7 March 2008

Lost | Retransmit m: ack_seq SACK(start_seq,end_seq) n n n/ m Lost n Transmit Sacked Behavior of the problem we found snd_nxt snd_una 0/ 10 1 2 3 4 5 6 7/ 11 8 9 10 11 12 SACK(1,3) ＜ SACK(8,11) Scoreboarding process 10 SACK{(8,11),(1,3)} No retransmit (NG) The packet 0 remains in the not-yet-SACKed state until an RTO occurs. This is the problem we found in detecting loss of retransmission. PFLDnet 2008 @Manchester 7 March 2008

snd_nxt snd_nxt SACK{(8,11),(1,3)} Lost | Retransmit m: ack_seq SACK(start_seq,end_seq) n n n/ m Lost n Transmit Sacked Example of an application of the fix snd_una 0/ 10 2 1 1 2 3 4 5 6 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 3 7/ 11 < 11 max SACK(1,3) SACK(8,11) FIX Scoreboarding process 10 0/ 13 Retransmit By this very simple fix, the sender can detect the loss of retransmission correctly PFLDnet 2008 @Manchester 7 March 2008

Evaluation(1) goodput * our fix is incorporated PFLDnet 2008 @Manchester 7 March 2008

Without our fix, bandwidth became 0 after a packet loss, and it took 200ms before slow start began. • With our fix, bandwidth did not drop to 0 after a packet loss. Evaluation (2) bandwidth PFLDnet 2008 @Manchester 7 March 2008

Without our fix, RTOs occurred and cwnd became 0 after a packet loss. • With our fix, there were no RTOs, and cwnd did not drop to 0. Evaluation (3) cwnd/RTO 2.6.23 2.6.23 + fix 2.6.23 2.6.23 + fix PFLDnet 2008 @Manchester 7 March 2008

Evaluation (4) CUBIC • Without our fix, the result was almost same as the case of BIC. • With our fix, there were no RTOs, but cwnd increased very slowly. • The congestion window size of CUBIC is dominated by the elapsed time from the last loss event 2.6.23 2.6.23 + fix PFLDnet 2008 @Manchester 7 March 2008

With 100ms RTT, the effects of our fix were small. • The variation of goodput was still large even with our fix. Evaluation (5) 100ms RTT 1/100 PFLDnet 2008 @Manchester 7 March 2008

Conclusion • When we used a kernel of 2.6.23 or earlier, we observed a large variation of goodput if retransmitted packets were lost. • We found that it was caused by failed detection of loss of retransmission. • We fixed the problem, and the goodput was improved about 30%. • The fix has been incorporated into the latest Linux kernel 2.6.24. PFLDnet 2008 @Manchester 7 March 2008

Several hints to fix problems on network • It is difficult to find the problems on real network. • Network emulation is useful. • it provides a re-producible experimentation environment • Sampling of kernel variables sometimes fails when the kernel is very busy. • Packet capturing by hardware is useful • Data can be retrieved packet by packet independent from the kernel. PFLDnet 2008 @Manchester 7 March 2008

Y. Kodama, R. Takano, F. Okazaki, and T. Kudoh