1 / 25

Towards Predictable Datacenter Networks

This study focuses on achieving predictable performance in datacenter networks to address challenges like network performance variability and unstable application performance. It introduces abstractions such as Virtual Clusters and Oktopus, which aim to guarantee application performance, reduce tenant costs, and boost provider revenue. The allocation algorithms for virtual and oversubscribed clusters are detailed, along with the enforcement mechanisms for rate limiting. Design considerations and discussions on network management, routing, and failure resilience are also included. Lastly, an evaluation methodology based on simulation setup is presented to minimize tenant costs.

ajonson
Download Presentation

Towards Predictable Datacenter Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Predictable Datacenter Networks Hitesh Ballani, Paolo Costa, Thomas Karagiannis, Ant Rowstron SIGCOMM 2011 Presenter: Lili Sun 2020/1/5

  2. Outline • Motivation and Goals • Virtual Network Abstractions • Oktopus • Evaluation • Conclusion • Discussion Clues

  3. Interface Provider Tenant Storage Resource Computing Resource Cloud Datacenter Production Datacenter Physical Network Virtual Network (VMs) Backgrounds • Datacenter • Cloud datacenter • Production datacenter • Interface • computing resources • storage resources

  4. Motivation and Goals • Motivation: Network performance variability • Cloud datacenter (system load and VM placement) • Production datacenter (variable network bandwidth) • Challenges • application performance unstable • tenant costs unpredictable • provider revenue loss • Goals • Guaranteed application performance • Tenants' cost • Providers' revenue

  5. Virtual Network Abstractions • Virtual network abstractions • Virtual cluster (VC) • Virtual oversubscribed cluster (VOC) • Design goals • Tenant suitability: An intuitive way about network performance • Provider flexibility: multiplex many virtual networks on their physical network

  6. Virtual cluster • Tenant request: <N, B> • All-to-all traffic patterns • Suitable for data-intensive applications

  7. Virtual oversubscribed cluster • Tenant request: <N,B,S,O> • Local communication patterns • Suitable for the apps have special communications patterns.

  8. Oktopus • Support tenants opt for • Virtual cluster • Virtual oversubscribed cluster • No virtual cluster • Two main components • Management plane (request & account for network resources and maintain bandwidth reservations) • Data plane (enforce the bandwidth available) • Network manager • Meet the bandwidth demands • Maximize the number of tenants

  9. Cluster Allocation • A virtual cluster request r : <N,B> • Topology: tree-like physical network • Bandwidth required on link : L 200Mbps 100Mbps 100Mbps 100Mbps 100Mbps 100Mbps 100Mbps

  10. Allocation Algorithm • Allocated VMs to a sub-tree (a machine, a rack, a pod) • Number of empty VM slots in the sub-tree • Residual bandwidth on the physical link • For a machine • For the same level • Choose the sub-tree with the least amount of residual bandwidth • For the different levels • Start from the lowest level • Physical machine < racks <pods (level) • Goals • a greater outbound bandwidth available • allow accommodate more future tenants.

  11. Oversubscribed Cluster Allocation • An oversubscribed cluster request: <N,S,B,O> • The total bandwidth required by group i on link : • The bandwidth to be reserved on link L for request r is the sum across all the groups

  12. Allocation Algorithm • Individual group is similar to a virtual cluster • Reuse the cluster allocation algorithm • Conditional bandwidth needed for jth group of request r on link L : • The bandwidth required by groups [1,…,i] on L: • Allocate VMs to sub-tree v:

  13. VM1 Controller VM VM i EM1 EM EM i Enough BW Minimal rate …… …… Maximal rate Fair BW Enforcing Virtual Network • Rate limiting mechanism • Traditional ways: bandwidth reservation at switches • Oktopus: endhost-based rate enforcement • Design • Enforcement module: measures traffic rate to other VMs • Controller VM: calculates the max-min fair share • Enforcement module: uses per-destination-VM limiter to enforce them • Advantage • Calculating at Controller VM for each tenant reduce the control rate • Enforcement modules enable distributed rate limits • Tenant-specific computation reduces scale of the problem • compute rates for each virtual network (Sends traffic rate) (Per-destination-VM limiter) (Measures traffic rate) (Calculates traffic rate) …… (Max-min fair share) (Returns traffic rate) (Per-destination-VM limiter) (Measures traffic rate)

  14. Enforcing Virtual Network • Tenants without virtual network • Two-level priorities • Traffic from tenants with a virtual network is high level • Other traffic is low level (fair share) • Unused capacity in a VM with a virtual network • Weighted sharing mechanisms • Unused capacity is distributed among all tenants

  15. Design Discussion • NM and Routing • assumes that the datacenter has a simple tree topology • For the topologies with limited path diversity • For the even richer network topologies • Multiple physical links can be treated as a single aggregate link • NM can control datacenter routing to build tenant-specific trees • Failures • For failures of physical links and switches, our allocation algorithms can be extended to determine the tenant VMs that need to be migrated, and reallocated

  16. Evaluation • Simulation setup • Tc : minimum compute time for the job • Tn: the time for last flow to finish • T = max (Tc, Tn): the completion time • Tn < Tc: to minimize the tenants cost • Baseline: the purely VM-based resource allocation • locality-aware allocation algorithm • A flow’s bandwidth is calculated according to max-min fairness • Virtual network request • <N> can be expressed as <N,B> or <N,B,S,O> • Simulation breadth • The entire space for most parameters of interest in today’s datacenters • tenant bandwidth requirements, datacenter load, and physical topology oversubscription

  17. Production Datacenter Experiment • Job completion time

  18. Production Datacenter Experiment • Utilization • the allocation of VMs does not account for network demands

  19. Production Datacenter Experiment • Diverse communication patterns. • each tenant VM requires a different bandwidth

  20. Cloud Datacenter Experiment • Rejected Requests • tenant dynamics with requests arriving over time • admission control scheme

  21. Cloud Datacenter Experiment • Tenant costs and provider revenue • Tenant will be charged based on the time they occupy their VMs

  22. Cloud Datacenter Experiment • Charging for bandwidth • virtual network abstractions allow explicitly charging for network bandwidth • <N,B> for time T, Tenant cost: or

  23. Results and conclusion • Virtual network abstractions • practical, can be efficiently implemented and provide significant benefits • provide a simple way of information exchange between tenants and providers • Tenant • expose network requirement and pick the trade-off between the performance of applications and cost • Provider • account for the network resources and improve their revenue

  24. Failure of tenant VMs For the oversubscribed network cluster, if a tenant VM fails, does the failed VM or all tenant VMs in the intra-group need to be migrated and be reallocated? Because the communication between reallocated VM and other VMs will increases the bandwidth from the underlying physical infrastructure. Description of network bandwidth resources Network security Actual bandwidth requirement How to solve the problem of description of network bandwidth resources? There is no datasets describing job bandwidth requirements. Compare to the physical switch, virtual switch has a weaker monitoring capability, so how to ensure the network security? For many tenant, they don't know how much bandwidth they need exactly for all kinds of applications, so how to deal with this problem? Different from the computing and storage resources, the use of bandwidth for one tenant will impact other tenants because of the limited total bandwidth resources. So besides the pricing model, how to make sure that the tenant’s bandwidth requirement is appropriate (not too much or too little) (for example the monitor system to provide the actual demands to tenants) Discussion clues • Actual bandwidth requirement • Failure of tenant VMs • Description of network bandwidthresources • Network security

  25. Thank you!

More Related