1 / 35

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload . Ardalan Kangarlou , Sahan Gamage , Ramana Kompella , Dongyan Xu Department of Computer Science Purdue University. Cloud Computing and HPC. Background and Motivation.

locke
Download Presentation

vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload ArdalanKangarlou,SahanGamage, RamanaKompella, DongyanXu Department of Computer SciencePurdue University

  2. Cloud Computing and HPC

  3. Background and Motivation • Virtualization: A key enabler of cloud computing • Amazon EC2, Eucalyptus • Increasingly adopted in other real systems: • High performance computing • NERSC’s Magellan system • Grid/cyberinfrastructure computing • In-VIGO, Nimbus, Virtuoso

  4. VM Consolidation: A Common Practice • Multiple VMs hosted by one physical host • Multiple VMs sharing the same core • Flexibility, scalability, and economy VM 3 VM 1 VM 2 VM 4 Key Observation: VM consolidation negatively impacts network performance! Virtualization Layer Hardware

  5. Investigating the Problem Server Client VM 2 VM 3 VM 1 Sender Virtualization Layer Hardware

  6. Q1: How does CPU Sharing affect RTT ? 180 160 US West – Australia 140 US East – Europe 120 RTT (ms) 100 RTT increases in proportion to VM scheduling slice (30ms) 80 US East – West 60 40 2 3 4 5 Number of VMs

  7. Q2: What is the Cause of RTT Increase ? VM scheduling latency dominates virtualization overhead! buf buf buf Sender VM 2 VM 1 VM 3 CDF Driver Domain (dom0) 30ms Device Driver Hardware +dom0 processing x wait time in buffer 30ms RTT Increase

  8. Q3: What is the Impact on TCP Throughput ? +dom0 x VM Connection to the VM is much slower than dom0!

  9. Our Solution: vSnoop • Alleviates the negative effect of VM scheduling on TCP throughput • Implemented within the driver domain to accelerate TCP connections • Does not require any modifications to the VM • Does not violate end-to-end TCP semantics • Applicable across a wide range of VMMs • Xen, VMware, KVM, etc.

  10. TCP Connection to a VM Driver Domain Sender VM1 Buffer Scheduled VM SYN Sender establishes a TCP connection to VM1 SYN VM2 VM Scheduling Latency VM3 VM1 buffer RTT SYN VM1 SYN,ACK SYN,ACK SYN,ACK VM2 VM3 VM Scheduling Latency RTT VM1 time

  11. Key Idea: Acknowledgement Offload Driver Domain Sender VM Shared Buffer Scheduled VM SYN w/ vSnoop SYN VM2 Faster progress during TCP slowstart VM3 VM1 buffer SYN,ACK VM1 SYN,ACK SYN,ACK VM2 VM3 VM1 time

  12. vSnoop’s Impact on TCP Flows • TCP Slow Start • Early acknowledgements help progress connections faster • Most significant benefit for short transfers that are more prevalent in data centers [Kandula IMC’09], [Benson WREN’09] • TCP congestion avoidance and fast retransmit • Large flows in the steady state can also benefit from vSnoop • Benefit not as much as for Slow Start

  13. Challenges • Challenge 1: Out-of-order/special packets (SYN, FIN packets) • Solution: Let the VM handle these packets • Challenge 2:Packet loss after vSnoop • Solution: Let vSnoop acknowledge only if room in buffer • Challenge 3: ACKs generated by the VM • Solution: Suppress/rewrite ACKs already generated by vSnoop • Challenge 4: Throttle Receive window to keep vSnoop online • Solution: Adjusted according to the buffer size

  14. State Machine Maintained Per-Flow Early acknowledgements for in-order packets Packet recv Start Active (online) In-order pkt Buffer space available In-order pkt Buffer space available Out-of-order packet No buffer In-order pkt No buffer Unexpected Sequence No buffer (offline) Out-of-order packet Don’t acknowledge Pass out-of-order pkts to VM

  15. vSnoop Implementation in Xen Tuning Netfront VM2 VM1 VM3 Netfront Netfront Netfront buf buf buf Netback Netback Netback Bridge vSnoop Driver Domain (dom0)

  16. Evaluation • Overheads of vSnoop • TCP throughput speedup • Application speedup • Multi-tier web service (RUBiS) • MPI benchmarks (Intel, High-Performance Linpack)

  17. Evaluation – Setup • VM hosts • 3.06GHz Intel Xeon CPUs, 4GB RAM • Only one core/CPU enabled • Xen 3.3 with Linux 2.6.18 for the driver domain (dom0) and the guest VMs • Client machine • 2.4GHz Intel Core 2 Quad CPU, 2GB RAM • Linux 2.6.19 • Gigabit Ethernet switch

  18. vSnoop Overhead • Profiling per-packet vSnoop overhead using Xenoprof[Menon VEE’05] • Per-packet CPU overhead for vSnoop routines in dom0 Minimal aggregate CPU overhead

  19. TCP Throughput Improvement • 3 VMs consolidated, 1000 transfers of a 100KB file • Vanilla Xen, Xen+tuning, Xen+tuning+vSnoop 30x Improvement 0.192MB/s 0.778MB/s 6.003MB/s Median +Vanilla Xen x Xen+tuning * Xen+tuning+vSnoop

  20. TCP Throughput: 1 VM/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 100KB 250KB 500KB 100MB Transfer Size

  21. TCP Throughput: 2 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 100KB 250KB 500KB 100MB Transfer Size

  22. TCP Throughput: 3 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 250KB 500KB 100KB 100MB Transfer Size

  23. TCP Throughput: 5 VMs/Core Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 vSnoop’s benefit rises with higher VM consolidation 0.60 Normalized Throughput 0.40 0.20 0.00 1MB 50KB 10MB 100KB 250KB 500KB 100MB Transfer Size

  24. TCP Throughput: Other Setup Parameters • CPU load for VMs • Number of TCP connections to VM • Driver domain on separate core • Sender being a VM vSnoop consistently achieves significant TCP throughput improvement

  25. Application-Level Performance: RUBiS RUBiS Clients Apache MySQL dom1 dom2 dom1 dom2 vSnoop vSnoop Client Threads dom0 dom0 Client Server1 Server2

  26. RUBiS Results

  27. Application-level Performance – MPI Benchmarks • Intel MPI Benchmark: Network intensive • High-performance Linpack: CPU intensive MPI nodes dom1 dom2 dom1 dom2 dom1 dom2 dom1 dom2 vSnoop vSnoop vSnoop vSnoop dom0 dom0 dom0 dom0 Server4 Server1 Server2 Server3

  28. Intel MPI Benchmark Results: Broadcast Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 40% Improvement 0.20 0.00 1MB 4MB 8MB 2MB 64KB 256KB 512KB 128KB Message Size

  29. Intel MPI Benchmark Results: All-to-All Xen Xen+tuning Xen+tuning+vSnoop 1.00 0.80 0.60 Normalized Execution Time 0.40 0.20 0.00 1MB 2MB 8MB 4MB 64KB 256KB 512KB 128KB Message Size

  30. HPL Benchmark Results Xen Xen+tuning+vSnoop 1.800 40% 1.600 1.400 1.200 1.000 Gflops 0.800 0.600 0.400 0.200 0.000 (6K,2) (6K,4) (6K,8) (8K,8) (4K,2) (4K,4) (4K,8) (8K,2) (8K,4) (4K,16) (6K,16) (8K,16) Problem Size and Block Size (N,NB)

  31. Related Work • Optimizing virtualized I/O path • Menon et al. [USENIX ATC’06,’08; ASPLOS’09] • Improving intra-host VM communications • XenSocket [Middleware’07], XenLoop [HPDC’08], Fido [USENIX ATC’09], XWAY [VEE’08], IVC [SC’07] • I/O-aware VM scheduling • Govindan et al. [VEE’07], DVT [SoCC’10]

  32. Conclusions • Problem: VM consolidation degrades TCP throughput • Solution: vSnoop • Leverages acknowledgment offloading • Does not violate end-to-end TCP semantics • Is transparent to applications and OS in VMs • Is generically applicable to many VMMs • Results: • 30x improvement in median TCP throughput • About 30% improvement in RUBiS benchmark • 40-50% reduction in execution time for Intel MPI benchmark

  33. Thank you. For more information: http://friends.cs.purdue.edu/dokuwiki/doku.php?id=vsnoop Or Google “vSnoop Purdue”

  34. TCP Benchmarks cont. • Testing different scenarios: • a) 10 concurrent connections • b) Sender also subject to VM scheduling • c) Driver domain on a separate core a) b) c)

  35. TCP Benchmarks cont. • Varying CPU load for 3 consolidated VMs: 40% CPU load: 60% CPU load: 80% CPU load:

More Related