TCP Incast in Data Center Networks

TCP Incastin Data Center Networks A study of the problem and proposed solutions

Outline • TCP Incast - Problem Description • Motivation and challenges • Proposed Solutions • Evaluation of proposed solutions • Conclusion • References

TCP Incast– Problem Description • Incast jargons: • Barrier Synchronized Workload • SRU (Server Request Unit) • Goodput, Throughput • MTU • BDP • and TCP acronyms like RTT, RTO, CA, AIMD, etc.

TCP Incast– Problem A typical implementation scenario in the Data Centers

TCP Incast - Problem • Many-to-one barrier synchronized workload: • Receiver requests k blocks of data from S storage servers. • Each block of data striped across S storage servers • Each server responses with a “fixed” amount of data. (fixed-fragment workload) • Client won’t request block k+1 until all the fragments of block k have been received. • Datacenter scenario: • k=100 • S = 1-48 • fragment size : 256KB

TCP Incast - Problem Goodput Collapse

TCP Incast - Problem • Switch buffers are inherently small in magnitude i.e. 32KB-128KB per port • Bottleneck switch buffer gets overwhelmed by synchronous sending of data by servers and consequently switch drops the packets • RTT is typically 1-2ms in datacenters and RTOmin is 200ms. This gap results in packets not getting retransmitted soon • All the other senders who have already sent the data have to wait until the dropped packet gets retransmitted. • But large RTO implies that retransmission will be delayed resulting in decrease in goodput

Motivation • Internet datacenters support a myriad of service and applications. • Google, Microsoft, Yahoo, Amazon • Vast majority of datacenter use TCP for communication between nodes. • Companies like Facebook have adopted UDP as their transport layer protocol to avoid TCP incast and endowed the responsibility of flow control to the application layer protocols • The unique workload such as Mapreduce , Hadoop, scale and environment of internet datacenter violate the WAN assumption on which TCP was originally designed. • Ex: In a Web search application, many workers respond near simultaneously to search queries in which key-value pairs from many Mappers are transferred to appropriate Reducers during the shuffle stage

Incast in Bing (Microsoft) Ref : Slide from Albert Greenberg(Microsoft) presentation at SIGCOMM’10

Challenges • Minimum changes to TCP implementation needed • Cannot decrease the RTO min to less than 1ms as operating systems fail to work with high resolution timers for RTO • Have to address internal and external flows • Cannot afford large buffer at the switch because it is costly • Solution needs to be easily deployed and should be cost effective

Outline • TCP Incast - Problem Description • Characteristics of the problem and challenges • Proposed Solutions • Evaluation of proposed solutions • Conclusion • References

Proposed Solutions Solutions can be divided into • Application level solutions • Transport layer solutions • Transport layer solutions aided by switch’s ECN and QCN capabilities. Alternative way to categorize the solutions • Avoiding timeouts in TCP • Reducing RTOmin • Replace TCP • Call lower layer functionalities like Ethernet Flow control for help

Understanding the problem… • Collaborated study by EECS Berkeley and Intel labs[1] • Their study focused on • proving this problem is general, • deriving an analytical model • Studying the impact of various modifications to TCP on incast behavior.

Different RTO Timers Observations: • Initial goodput min occurs at the same number of servers. • Smaller RTO timer value has faster goodput “recovery” rate • The decrease rate after local max is the same between different min RTO settings.

Decreasing the RTO – proportional increase in the goodput • Surprisingly, 1ms RTO with delayed ack enabled was a better performer • Delayed ack disabled in 1ms forces overriding of TCP congestion window on the sender side due to high transmission of acks resulting in fluctuations in smoothed RTT

QUANTITATIVE MODEL: • D: total amount of data to be sent, 100 blocks of 256KB • L: total transfer time of the workload without and RTO events. • R: the number of RTO events during the transfer • S: number of server: • r: the value of the minimum RTO timer value • I : Interpacket wait time • Modeling of R and I was done based on empirically observed behavior Net goodput:

Key Observations • A smaller minimum RTO timer value means larger goodput values for the initial minimum. • The initial goodput minimum occurs at the same number of senders, regardless the value of the minimum RTO times. • The second order goodput peak occurs at a higher number of senders for a larger RTO timer value • The smaller the RTO timer values, the faster the rate of recovery between the goodput minimum and the second order goodput maximum. • After the second order goodput maximum, the slope of goodput decrease is the same for different RTO timer values.

Application level solution[5] • No changes required to the TCP stack or network switches • Based on scheduling server responses to the same data block so that no data loss occurs • Caveats: • Retransmissions can be interesting • Scheduling at the application level cannot be easily synchronized • Limited control over transport layer

Application level solution

ICTCP-Incast Congestion Control for TCP in Data Center Networks[8] • Features • Solution based on modifying Congestion window dynamically • Can choose implementation on the receiver side only • focuses on avoiding packet losses before the incast congestion occurs • Test implementation on Windows NDIS • Novelties in the solution: • Using Available bandwidth to coordinate the receive window increase in all incoming connections • Per flow congestion control is performed independently in slotted time of RTT • Receive window adjustment is based on the ratio of difference of measured and expected throughput over expected one.

Design considerations • Receiver knows how much throughput is achieved and what is the available bandwidth • While overly controlled window mechanism may constrain TCP performance, less controlled does not prevent incast congestion • Only low latency flows less than 2ms are considered • Receive window increase is determined by the available bandwidth • Frequency of receive window based congestion control should be per-flow • Receive window based scheme should adjust the window according to link congestion and application requirement

ICTCP Algorithm • Control trigger: Available bandwidth • Calculate available bandwidth • Estimate the potential throughput/flow increase before increasing receive window • Time divided into two slots • For each network interface, measure available bandwidth in first sub-slot and calculate quota for window increase in second sub-slot • Ensure the total increase in the receive window is less than the total available bandwidth calculated in the first sub-slot

ICTCP Algorithm • per connection control interval: 2*RTT • to estimate the throughput of a TCP connection for receive window adjustment, the shortest time scale is an RTT for that connection • control interval for a TCP connection is 2*RTT in ICTCP • One RTT latency for adjusted window to take effect • One additional RTT for measuring throughput with the newly adjusted window • For any TCP connection, if now time is in the second global sub-slot and it observes that the past time is larger than 2*RTT since its last receive window adjustment, it may increase its window based on newly observed TCP throughput and current available bandwidth.

ICTCP Algorithm • Window adjustment on single connection • Receive window is adjusted based on its incoming measured throughput • Measured throughput is current requirement of the application over that TCP connection • Expected throughput is the expectation of throughput on that TCP connection if throughput is constrained by receive window • Define ratio of throughput difference • Make receive adjustment based on the following conditions • MSS and i increase receive window if it’s now in global second sub-slot and there is enough quota of available bandwidth on the network interface. Decrease the quota correspondingly if the receive window is increased. • decrease receive window by one MSS if this condition holds for three continuous RTT. The minimal receive window is 2*MSS. • Otherwise, keep current receive window.

ICTCP Algorithm • Fairness Controller for multiple connections • Fairness considered only for low latency flows • Decrease window for fairness only when BWA < 0.2C • For window decrease, we cut the receive window by one MSS3, for some selected TCP connections. • Select those connections that have receive window larger than the average window value of all connections. • For window increase, this is automatically achieved by our window adjustment

ICTCP Experimental Results • Testbed • 47 servers • 1 LB4G 48-port Gigabit Ethernet switch • Gigabit Ethernet Broadcom NIC at the hosts • Windows Server 2008 R2 Enterprise 64-bit version

Issues with ICTCP • ICTCP scalability to a large number of TCP connections is an issue because receive window can decrease below 1 MSS resulting in degraded TCP performance • Extending ICTCP to handle congestion in general cases where sender and receiver are not under the same switch and bottleneck link is not the last link to the receiver • ICTCP for future high bandwidth low latency networks

DCTCP • Features • TCP like protocol for data centers • It uses ECN (Explicit Congestion Notification) to provide multi-bit feedback to the end-hosts • Claim is that DCTCP provides better throughput than TCP using 90% less buffer space • Provides high burst tolerance and low latency for short flows • Also can handle 10X increase in foreground and background traffic without significant hit on the performance front.

DCTCP • Overview • Applications in data centers largely require • Low latency for short flows • High burst tolerance • High utilization for long flows • Low latency for short flows have real time deadlines of about approximately 10-100ms • To avoid continuously modifying internal data structures, high utilization for long flows is essentia • Study analyzed production traffic from app. 6000 servers with app. 150 TB of traffic for a period of 1 month • Query traffic (of 2KB to 20KB) experience incast impairment

DCTCP • Overview (Contd) • Proposed DCTCP uses ECN capability available in most modern switches • Uses multi-bit feedback on congestion from single bit stream of ECN marks • Essence of the proposal is to keep switch buffer occupancies persistently low, while maintaining high throughput • to control queue length at switches, use Active Queue management(AQM) approach that uses explicit feedback from congested switches • Claim is also that only as much as 30 LoC to TCP and setting of a single parameter on switches is needed • DCTCP focuses on 3 problems • Incast • Queue Buildup • Buffer pressure Our area

DCTCP • Algorithm • Mainly concentrates on extent of congestion rather than just the presence of it. • Act of deriving multi-bit feedback from single bit sequence of marks • Three components of the algorithm • Simple marking at the switch • ECN-echo at the receiver • Controller at the sender

DCTCP • Simple marking at the switch • An arriving packet is marked with CE (Congestion Experienced) codepoint if the queue occupancy is greater than K (marking threshold) • Marking is not based on average queue length , but instantaneous • ECN-ECHO at the receiver: • Normally, in TCP, an ECN-ECHO is set to all packets until receiver gets CWR from the sender • DCTCP receiver sends a ECN-ECHO only if a CE codepoint is seen on the packet

DCTCP • Controller at the sender: • Sender maintains an estimate of the fraction of packets that are marked and updated every window • When is close to 0 , low congestion and close to 1 indicate high congestion • While TCP cuts its window by half, DCTCP uses to determine the sender’s window size • cwnd cwnd x (1 - /2)

DCTCP • Modeling of when window reaches W*(when K is at the critcal point) • Maximum size Q max of the queue depends on the number of synchronously sending servers N • Lower bound for K can be derived by

DCTCP • How DCTCP solves Incast? • TCP suffers from timeouts when N>10 • DCTCP senders receive ECN marks, slow their rate • Suffers timeouts when N large enough to overwhelm static buffer size • Solution is Dynamic Buffering

Evaluation of proposed solutions • Application level solution • Genuine retransmissions  cascading timeouts  congestion • Scheduling at the application level cannot be easily synchronized • Limited control over transport layer • ICTCP- Solution that needs minimal change and is cost effective • ICTCP scalability to a large number of TCP connections is an issue • Extending ICTCP to handle congestion in general cases has a limited solution • ICTCP for future high bandwidth low latency networks will need extra support from Link layer technologies • DCTCP- Solution that needs minimal change but requires switch support • DCTCP requires dynamic buffering for larger number of senders

Conclusion • No solution completely solves the problem other than configuring less RTO • Solutions have less focus on foreground an background traffic together • Need solutions which are cost effective+ requiring minimal change to environment+ and of course solving incast!!!

References • Y.Chen, R.Griffith, J.Liu, R.H.Katz, and A.D.Joseph, “Understanding TCP Incast Throughput Collapse in Datacenter Networks” in Proc. of ACM WREN, 2009. • Kulkarni.S., Agrawal. P, “A Probabilistic Approach to Address TCP Incast in Data Center Networks”,Distributed Computing Systems Workshops (ICDCSW), 2011 • Peng Zhang, Hongbo Wang, Shiduan Cheng “Shrinking MTU to Mitigate TCP Incast Throughput Collapse in Data Center Networks”, Communications and Mobile Computing (CMC), 2011 • Yan Zhang, Ansari, N.,”On mitigating TCP Incast in Data Center Networks”, INFOCOM Proceedings-IEEE,2011 • Maxim Podlesny, Carey Williamson, “An application-level solution for the TCP-incast problem in data center networks”,IWQoS ’11: Proceedings of the 19th International Workshop on Quality of Service,IEEE, June,2011 • Mohammad Alizadeh, Albert Greenberg, David A. Maltz, JitendraPadhye, Parveen Patel, BalajiPrabhakar, SudiptaSengupta, MurariSridharan, “Data center TCP (DCTCP)”, SIGCOMM ’10: Proceedings of the ACM\SIGCOMM, August 2010 • HongyunZheng, Changjia Chen, ChunmingQiao, “Understanding the Impact of Removing TCP Binary Exponential Backoff in Data Centers”, Communications and Mobile Computing (CMC), 2011 • Haitao Wu, ZhenqianFeng, ChuanxiongGuo, Yongguang Zhang, “ICTCP: Incast Congestion Control for TCP in data center networks”,Co-NEXT ’10: Proceedings of the 6th International COnference, ACM,November2010

TCP Incast in Data Center Networks