1 / 43

TCP Incast in Data Center Networks

TCP Incast in Data Center Networks. A study of the problem and proposed solutions. Outline. TCP Incast - Problem Description Motivation and challenges Proposed Solutions Evaluation of proposed solutions Conclusion References. Outline. TCP Incast - Problem Description

tommy
Download Presentation

TCP Incast in Data Center Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TCP Incastin Data Center Networks A study of the problem and proposed solutions

  2. Outline • TCP Incast - Problem Description • Motivation and challenges • Proposed Solutions • Evaluation of proposed solutions • Conclusion • References

  3. Outline • TCP Incast - Problem Description • Motivation and challenges • Proposed Solutions • Evaluation of proposed solutions • Conclusion • References

  4. TCP Incast– Problem Description • Incast jargons: • Barrier Synchronized Workload • SRU (Server Request Unit) • Goodput, Throughput • MTU • BDP • and TCP acronyms like RTT, RTO, CA, AIMD, etc.

  5. TCP Incast– Problem A typical implementation scenario in the Data Centers

  6. TCP Incast - Problem • Many-to-one barrier synchronized workload: • Receiver requests k blocks of data from S storage servers. • Each block of data striped across S storage servers • Each server responses with a “fixed” amount of data. (fixed-fragment workload) • Client won’t request block k+1 until all the fragments of block k have been received. • Datacenter scenario: • k=100 • S = 1-48 • fragment size : 256KB

  7. TCP Incast - Problem Goodput Collapse

  8. TCP Incast - Problem • Switch buffers are inherently small in magnitude i.e. 32KB-128KB per port • Bottleneck switch buffer gets overwhelmed by synchronous sending of data by servers and consequently switch drops the packets • RTT is typically 1-2ms in datacenters and RTOmin is 200ms. This gap results in packets not getting retransmitted soon • All the other senders who have already sent the data have to wait until the dropped packet gets retransmitted. • But large RTO implies that retransmission will be delayed resulting in decrease in goodput

  9. Outline • TCP Incast - Problem Description • Motivation and challenges • Proposed Solutions • Evaluation of proposed solutions • Conclusion • References

  10. Motivation • Internet datacenters support a myriad of service and applications. • Google, Microsoft, Yahoo, Amazon • Vast majority of datacenter use TCP for communication between nodes. • Companies like Facebook have adopted UDP as their transport layer protocol to avoid TCP incast and endowed the responsibility of flow control to the application layer protocols • The unique workload such as Mapreduce , Hadoop, scale and environment of internet datacenter violate the WAN assumption on which TCP was originally designed. • Ex: In a Web search application, many workers respond near simultaneously to search queries in which key-value pairs from many Mappers are transferred to appropriate Reducers during the shuffle stage

  11. Incast in Bing (Microsoft) Ref : Slide from Albert Greenberg(Microsoft) presentation at SIGCOMM’10

  12. Challenges • Minimum changes to TCP implementation needed • Cannot decrease the RTO min to less than 1ms as operating systems fail to work with high resolution timers for RTO • Have to address internal and external flows • Cannot afford large buffer at the switch because it is costly • Solution needs to be easily deployed and should be cost effective

  13. Outline • TCP Incast - Problem Description • Characteristics of the problem and challenges • Proposed Solutions • Evaluation of proposed solutions • Conclusion • References

  14. Proposed Solutions Solutions can be divided into • Application level solutions • Transport layer solutions • Transport layer solutions aided by switch’s ECN and QCN capabilities. Alternative way to categorize the solutions • Avoiding timeouts in TCP • Reducing RTOmin • Replace TCP • Call lower layer functionalities like Ethernet Flow control for help

  15. Understanding the problem… • Collaborated study by EECS Berkeley and Intel labs[1] • Their study focused on • proving this problem is general, • deriving an analytical model • Studying the impact of various modifications to TCP on incast behavior.

  16. Different RTO Timers Observations: • Initial goodput min occurs at the same number of servers. • Smaller RTO timer value has faster goodput “recovery” rate • The decrease rate after local max is the same between different min RTO settings.

  17. Decreasing the RTO – proportional increase in the goodput • Surprisingly, 1ms RTO with delayed ack enabled was a better performer • Delayed ack disabled in 1ms forces overriding of TCP congestion window on the sender side due to high transmission of acks resulting in fluctuations in smoothed RTT

  18. QUANTITATIVE MODEL: • D: total amount of data to be sent, 100 blocks of 256KB • L: total transfer time of the workload without and RTO events. • R: the number of RTO events during the transfer • S: number of server: • r: the value of the minimum RTO timer value • I : Interpacket wait time • Modeling of R and I was done based on empirically observed behavior Net goodput:

  19. Key Observations • A smaller minimum RTO timer value means larger goodput values for the initial minimum. • The initial goodput minimum occurs at the same number of senders, regardless the value of the minimum RTO times. • The second order goodput peak occurs at a higher number of senders for a larger RTO timer value • The smaller the RTO timer values, the faster the rate of recovery between the goodput minimum and the second order goodput maximum. • After the second order goodput maximum, the slope of goodput decrease is the same for different RTO timer values.

  20. Application level solution[5] • No changes required to the TCP stack or network switches • Based on scheduling server responses to the same data block so that no data loss occurs • Caveats: • Retransmissions can be interesting • Scheduling at the application level cannot be easily synchronized • Limited control over transport layer

  21. Application level solution

  22. Application level solution

  23. ICTCP-Incast Congestion Control for TCP in Data Center Networks[8] • Features • Solution based on modifying Congestion window dynamically • Can choose implementation on the receiver side only • focuses on avoiding packet losses before the incast congestion occurs • Test implementation on Windows NDIS • Novelties in the solution: • Using Available bandwidth to coordinate the receive window increase in all incoming connections • Per flow congestion control is performed independently in slotted time of RTT • Receive window adjustment is based on the ratio of difference of measured and expected throughput over expected one.

  24. Design considerations • Receiver knows how much throughput is achieved and what is the available bandwidth • While overly controlled window mechanism may constrain TCP performance, less controlled does not prevent incast congestion • Only low latency flows less than 2ms are considered • Receive window increase is determined by the available bandwidth • Frequency of receive window based congestion control should be per-flow • Receive window based scheme should adjust the window according to link congestion and application requirement

  25. ICTCP Algorithm • Control trigger: Available bandwidth • Calculate available bandwidth • Estimate the potential throughput/flow increase before increasing receive window • Time divided into two slots • For each network interface, measure available bandwidth in first sub-slot and calculate quota for window increase in second sub-slot • Ensure the total increase in the receive window is less than the total available bandwidth calculated in the first sub-slot

  26. ICTCP Algorithm • per connection control interval: 2*RTT • to estimate the throughput of a TCP connection for receive window adjustment, the shortest time scale is an RTT for that connection • control interval for a TCP connection is 2*RTT in ICTCP • One RTT latency for adjusted window to take effect • One additional RTT for measuring throughput with the newly adjusted window • For any TCP connection, if now time is in the second global sub-slot and it observes that the past time is larger than 2*RTT since its last receive window adjustment, it may increase its window based on newly observed TCP throughput and current available bandwidth.

  27. ICTCP Algorithm • Window adjustment on single connection • Receive window is adjusted based on its incoming measured throughput • Measured throughput is current requirement of the application over that TCP connection • Expected throughput is the expectation of throughput on that TCP connection if throughput is constrained by receive window • Define ratio of throughput difference • Make receive adjustment based on the following conditions • MSS and i increase receive window if it’s now in global second sub-slot and there is enough quota of available bandwidth on the network interface. Decrease the quota correspondingly if the receive window is increased. • decrease receive window by one MSS if this condition holds for three continuous RTT. The minimal receive window is 2*MSS. • Otherwise, keep current receive window.

  28. ICTCP Algorithm • Fairness Controller for multiple connections • Fairness considered only for low latency flows • Decrease window for fairness only when BWA < 0.2C • For window decrease, we cut the receive window by one MSS3, for some selected TCP connections. • Select those connections that have receive window larger than the average window value of all connections. • For window increase, this is automatically achieved by our window adjustment

  29. ICTCP Experimental Results • Testbed • 47 servers • 1 LB4G 48-port Gigabit Ethernet switch • Gigabit Ethernet Broadcom NIC at the hosts • Windows Server 2008 R2 Enterprise 64-bit version

  30. Issues with ICTCP • ICTCP scalability to a large number of TCP connections is an issue because receive window can decrease below 1 MSS resulting in degraded TCP performance • Extending ICTCP to handle congestion in general cases where sender and receiver are not under the same switch and bottleneck link is not the last link to the receiver • ICTCP for future high bandwidth low latency networks

  31. DCTCP • Features • TCP like protocol for data centers • It uses ECN (Explicit Congestion Notification) to provide multi-bit feedback to the end-hosts • Claim is that DCTCP provides better throughput than TCP using 90% less buffer space • Provides high burst tolerance and low latency for short flows • Also can handle 10X increase in foreground and background traffic without significant hit on the performance front.

  32. DCTCP • Overview • Applications in data centers largely require • Low latency for short flows • High burst tolerance • High utilization for long flows • Low latency for short flows have real time deadlines of about approximately 10-100ms • To avoid continuously modifying internal data structures, high utilization for long flows is essentia • Study analyzed production traffic from app. 6000 servers with app. 150 TB of traffic for a period of 1 month • Query traffic (of 2KB to 20KB) experience incast impairment

  33. DCTCP • Overview (Contd) • Proposed DCTCP uses ECN capability available in most modern switches • Uses multi-bit feedback on congestion from single bit stream of ECN marks • Essence of the proposal is to keep switch buffer occupancies persistently low, while maintaining high throughput • to control queue length at switches, use Active Queue management(AQM) approach that uses explicit feedback from congested switches • Claim is also that only as much as 30 LoC to TCP and setting of a single parameter on switches is needed • DCTCP focuses on 3 problems • Incast • Queue Buildup • Buffer pressure Our area

  34. DCTCP • Algorithm • Mainly concentrates on extent of congestion rather than just the presence of it. • Act of deriving multi-bit feedback from single bit sequence of marks • Three components of the algorithm • Simple marking at the switch • ECN-echo at the receiver • Controller at the sender

  35. DCTCP • Simple marking at the switch • An arriving packet is marked with CE (Congestion Experienced) codepoint if the queue occupancy is greater than K (marking threshold) • Marking is not based on average queue length , but instantaneous • ECN-ECHO at the receiver: • Normally, in TCP, an ECN-ECHO is set to all packets until receiver gets CWR from the sender • DCTCP receiver sends a ECN-ECHO only if a CE codepoint is seen on the packet

  36. DCTCP • Controller at the sender: • Sender maintains an estimate of the fraction of packets that are marked and updated every window • When is close to 0 , low congestion and close to 1 indicate high congestion • While TCP cuts its window by half, DCTCP uses to determine the sender’s window size • cwnd cwnd x (1 - /2)

  37. DCTCP • Modeling of when window reaches W*(when K is at the critcal point) • Maximum size Q max of the queue depends on the number of synchronously sending servers N • Lower bound for K can be derived by

  38. DCTCP • How DCTCP solves Incast? • TCP suffers from timeouts when N>10 • DCTCP senders receive ECN marks, slow their rate • Suffers timeouts when N large enough to overwhelm static buffer size • Solution is Dynamic Buffering

  39. Outline • TCP Incast - Problem Description • Motivation and challenges • Proposed Solutions • Evaluation of proposed solutions • Conclusion • References

  40. Evaluation of proposed solutions • Application level solution • Genuine retransmissions  cascading timeouts  congestion • Scheduling at the application level cannot be easily synchronized • Limited control over transport layer • ICTCP- Solution that needs minimal change and is cost effective • ICTCP scalability to a large number of TCP connections is an issue • Extending ICTCP to handle congestion in general cases has a limited solution • ICTCP for future high bandwidth low latency networks will need extra support from Link layer technologies • DCTCP- Solution that needs minimal change but requires switch support • DCTCP requires dynamic buffering for larger number of senders

  41. Conclusion • No solution completely solves the problem other than configuring less RTO • Solutions have less focus on foreground an background traffic together • Need solutions which are cost effective+ requiring minimal change to environment+ and of course solving incast!!!

  42. References • Y.Chen, R.Griffith, J.Liu, R.H.Katz, and A.D.Joseph, “Understanding TCP Incast Throughput Collapse in Datacenter Networks” in Proc. of ACM WREN, 2009. • Kulkarni.S., Agrawal. P, “A Probabilistic Approach to Address TCP Incast in Data Center Networks”,Distributed Computing Systems Workshops (ICDCSW), 2011 • Peng Zhang, Hongbo Wang, Shiduan Cheng “Shrinking MTU to Mitigate TCP Incast Throughput Collapse in Data Center Networks”, Communications and Mobile Computing (CMC), 2011 •  Yan Zhang, Ansari, N.,”On mitigating TCP Incast in Data Center Networks”, INFOCOM Proceedings-IEEE,2011 • Maxim Podlesny, Carey Williamson, “An application-level solution for the TCP-incast problem in data center networks”,IWQoS ’11: Proceedings of the 19th International Workshop on Quality of Service,IEEE, June,2011 • Mohammad Alizadeh, Albert Greenberg, David A. Maltz, JitendraPadhye, Parveen Patel, BalajiPrabhakar, SudiptaSengupta, MurariSridharan, “Data center TCP (DCTCP)”, SIGCOMM ’10: Proceedings of the ACM\SIGCOMM, August 2010 • HongyunZheng, Changjia Chen, ChunmingQiao, “Understanding the Impact of Removing TCP Binary Exponential Backoff in Data Centers”, Communications and Mobile Computing (CMC), 2011 • Haitao Wu, ZhenqianFeng, ChuanxiongGuo, Yongguang Zhang, “ICTCP: Incast Congestion Control for TCP in data center networks”,Co-NEXT ’10: Proceedings of the 6th International COnference, ACM,November2010

More Related