200 likes | 367 Views
Solving the TCP-incast Problem with Application-Level Scheduling. Maxim Podlesny, University of Waterloo Carey Williamson, University of Calgary. Motivation. 2. 2. Emerging IT paradigms Data centers, grid computing, HPC, multi-core Cluster-based storage systems, SAN, NAS
E N D
Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of WaterlooCarey Williamson, University of Calgary
Motivation 2 2 • Emerging IT paradigms • Data centers, grid computing, HPC, multi-core • Cluster-based storage systems, SAN, NAS • Large-scale data management “in the cloud” • Data manipulation via “services-oriented computing” • Cost and efficiency advantages from IT trends, economy of scale, specialization marketplace • Performance advantages from parallelism • Partition/aggregation, MapReduce, BigTable, Hadoop • Think RAID at Internet scale! (1000x)
Problem Statement TCP retransmission timeouts • High-speed, low-latency network (RTT ≤ 0.1 ms) • Highly-multiplexed link (e.g., 1000 flows) • Highly-synchronized flows on bottleneck link • Limited switch buffer size (e.g., 32 KB) How to provide high goodput for data center applications? TCP throughput degradation N 3
Related Work 4 4 E. Krevat et al., “On Application-based Approaches to Avoiding TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of SuperComputing 2007 A. Phanishayee et al., “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of FAST 2008 Y. Chen et al., “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, WREN 2009 V. Vasudevan et al., “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication”, Proceedings of ACM SIGCOMM 2009 M. Alizadeh et al., “Data Center TCP”, Proc. ACM SIGCOMM 2010 A. Shpiner et al., “A Switch-based Approach to Throughput Collapse and Starvation in Data Centers”, IWQoS 2010
Summary Summary of Related Work • Data centers have specific network characteristics • TCP-incast throughput collapse problem emerges • Possible solutions: • Tweak TCP timers and/or parameters for this environment • Redesign (or replace!) TCP in this environment • Rewrite applications for this environment (Facebook) • Increase switch buffer sizes (extra queueing delay!) • Smart edge coordination for uploads/downloads 5
packet size S_DATA 1 small buffer B 2 3 switch link capacity C N Data Center System Model Logical data block (S) (e.g., 1 MB) Server Request Unit (SRU) (e.g., 32 KB) client N servers
Performance Comparisons • Internet vs. data center network: • Internet propagation delay: 10-100 ms • data center propagation delay: 0.1 ms • packet size 1 KB, link capacity 1 Gbps -> packet transmission time is 0.01 ms
Summary Determine maximum TCP flow concurrency (n) that can be supported without any packet loss Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling Analysis Overview (1 of 2) 8
Summary Determine maximum TCP flow concurrency (n) that can be supported without any packet loss Determine flow size in packets (based on SRU and MSS) Determine maximum outstanding packets per flow (Wmax) Determine max flow concurrency (based on B and Wmax) Arrange the servers into k groups of (at most) n servers each, by staggering the group scheduling Analysis Overview (2 of 2) 9
Summary Recall TCP slow start dynamics: Initial TCP congestion window (cwnd) is 1 packet Acks cause cwnd to double every RTT (1, 2, 4, 8, 16…) Consider TCP transfer of an arbitrary SRU (e.g., 21) Determine peak power-of-2 cwnd value (WA) Determine “residual window” for the last RTT (WB) Wmax depends on both WA and WB (e.g., WA+ WB/2 ) Determining Wmax 10
Scheduling Overview N n n n n n n 11
Scheduling Details Using lossless scheduling of server responses: maximum n servers responding simultaneously, with k groups of responding servers scheduled Server i (1 <= i <= N) starts responding at:
Theoretical Results • Maximum goodput of an application in a data center • with lossless scheduling is: • where: • S - size of a logical data block • T - actual completion time of an SRU • - SRU completion time used for scheduling • k – how many groups of servers to use • dmax - real system scheduling variance
Solution Analytical Model Results 14 14
Summary and Conclusion • Application-level scheduling for TCP-incast throughput collapse • Main idea: scheduling responses of servers so that there are no losses • Maximum goodput with lossless scheduling • Non-monotonic goodput, highly-sensitive to network configuration parameters
Future Work • Implementing and testing our solution in real data centers • Evaluating our solution for different application traffic scenarios