1 / 33

zUpdate : Updating Data Center Networks with Zero Loss

zUpdate : Updating Data Center Networks with Zero Loss. Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer , Dave Maltz (Microsoft). DCN is constantly in flux. Upgrade  Reboot. New Switch. Traffic Flows. DCN is constantly in flux.

yahto
Download Presentation

zUpdate : Updating Data Center Networks with Zero Loss

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. zUpdate:Updating Data Center Networks with Zero Loss Hongqiang Harry Liu (Yale University) Xin Wu (Duke University) Ming Zhang, Lihua Yuan, Roger Wattenhofer, Dave Maltz (Microsoft)

  2. DCN is constantly in flux Upgrade  Reboot New Switch Traffic Flows

  3. DCN is constantly in flux Traffic Flows Virtual Machines

  4. Network updates are painful for operators • Two weeks before update, Bob has to: • Coordinate with application owners • Prepare a detailed update plan • Review and revise the plan with colleagues Complex Planning Switch Upgrade Unexpected Performance Faults • At the night of update, Bob executes plan by hands, but • Application alerts aretriggered unexpectedly • Switch failures force him to backpedal several times. • Eight hours later, Bob is still stuck with update: • No sleep over night • Numerous application complaints • No quick fix in sight Laborious Process Bob: An operator

  5. Congestion-free DCN update is the key • Applications want network updates to be seamless • Reachability • Low network latency (propagation, queuing) • No packet drops • Congestion-free updates are hard • Many switches are involved • Multi-step plan • Different scenarios have distinct requirements • Interactions between network and traffic demand changes Congestion

  6. A clos network with ECMP All switches: Equal-Cost Multi-Path (ECMP) Link capacity: 1000 150 150 = 920 150 620 +150 + 150 150 300 300 300 300 600 600

  7. Switch upgrade: a naïve solution triggers congestion Link capacity: 1000 = 1070 = 920 + 300 620 + 150 +150 Drain AGG1 600

  8. Switch upgrade: a smarter solution seems to be working Link capacity: 1000 = 1070 = 970 + 50 + 150 620 +300 Drain AGG1 500 100 Weighted ECMP

  9. Traffic distribution transition Initial Traffic Distribution Congestion-free FinalTraffic Distribution Congestion-free Transition ? 300 0 600 300 500 100 300 300 Simple? NO! Asynchronous Switch Updates

  10. Asynchronous changes can cause transient congestion When ToR1 is changed but ToR5 is not yet: Link capacity: 1000 620 +300+ 150 = 1070 Drain AGG1 300 300 600 Not Yet

  11. Solution: introducing an intermediate step Final Initial Transition 300 0 600 300 500 100 300 300 Congestion-free regardless the asynchronizations Congestion-freeregardless the asynchronizations Intermediate ? 200 400 450 150

  12. How zUpdateperforms congestion-free update Update Scenario Update requirements Operator zUpdate Target Traffic Distribution Intermediate Traffic Distribution Intermediate Traffic Distribution Current Traffic Distribution Data Center Network Routing Weights Reconfigurations

  13. Key technical issues • Describing traffic distribution • Representing update requirements • Defining conditions for congestion-free transition • Computing an update plan • Implementing an update plan

  14. Describing traffic distribution : flow f’s load on the link from switch v to u =150 150 =300 300 600 Traffic Distribution:

  15. Representing update requirements When s2 recovers Drain s2 Constraint: = Constraint: = 0 To restore ECMP: To upgrade switch :

  16. Switch asynchronization exponentially inflates the possible load values Transition from old traffic distribution to new traffic distribution ingress f 2 4 6 1 egress f 8 7 5 3 Asynchronous updates can result in possible load values on link during transition. In large networks, it is impossible to check if the load value exceeds link capacity.

  17. Two-phase commit reduces the possible load values to two Transition from old traffic distribution to new traffic distribution • With two-phase commit, f’s load on link only has two possible values throughout a transition: ingress f egress 2 4 6 1 f version flip 8 7 5 3 or

  18. Flow asynchronizationexponentially inflates the possible load values f1 2 6 4 1 f1 + f2 8 f2 5 7 3 0 = Asynchronous updates to N independent flows can result in possible load values on link

  19. Handling flow asynchronization f1 2 6 4 1 [Congestion-free transition constraint] There is no congestion throughout a transition if and only if: 8 f2 5 0 7 3 = Basic idea: the capacity of link

  20. Computing congestion-free transition plan Linear Programming Constraint: Congestion-free Constraint: Update Requirements Constant: Current Traffic Distribution Variable: Target Traffic Distribution Variable: Intermediate Traffic Distribution Variable: Intermediate Traffic Distribution • Constraint: • Deliver all traffic • Flow conservation

  21. Implementing an update plan Weighted-ECMP ECMP • Computation time • Switch table size limit • Update overhead • Failure during transition • Traffic demand variation Other Flows Critical Flows Flows traversing bottleneck links

  22. Evaluations • Testbed experiments • Large-scale trace-driven simulations

  23. Testbed setup ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps ToR6,7: 6.2Gbps Drain AGG1 ToR5: 6Gbps ToR8: 6Gbps

  24. zUpdateachieves congestion-free switch upgrade Initial Intermediate 3Gbps 2Gbps 3Gbps 4Gbps 3Gbps 4.5Gbps 1.5Gbps 3Gbps Final 0 6Gbps 5Gbps 1Gbps

  25. One-step update causes transient congestion Initial 3Gbps 3Gbps 3Gbps 3Gbps Final 0 6Gbps 5Gbps 1Gbps

  26. Large-scale trace-driven simulations A production DCN topology Flows Test flows (1%)

  27. zUpdate beats alternative solutions Post-transition Loss Rate Transition Loss Rate 15 10 Loss Rate (%) 5 0 zUpdate zUpdate-OneStep ECMP-OneStep ECMP-Planned #step 1 2 1 300+

  28. Conclusion • Switch and flow asynchronization can cause severe congestion during DCN updates • We present zUpdate for congestion-free DCN updates • Novel algorithms to compute update plan • Practical implementation on commodity switches • Evaluations in real DCN topology and update scenarios The End

  29. Thanks & Questions?

  30. Updating DCN is a painful process Interactive Applications Switch Upgrade Any performance disruption? How bad will the latency be? Operator How long will the disruption last? Uh?… This is Bob What servers will be affected?

  31. Network update: a tussle between applications and operators • Applications want network update to be fast and seamless • Update can happen on demand • No performance disruption during update • Network update is time consuming • Nowadays, an update is planned and executed by hands • Rolling back in unplanned cases • Network update is risky • Human errors • Accidents

  32. Challenges in congestion-free DCN update • Many switches are involved • Multi-step plan • Different scenarios have distinctive requirements • Switch upgrade/failure recovery • New switch on-boarding • Load balancer reconfiguration • VM migration • Coordination between changes in routing (network) and traffic demand (application) Help!

  33. Related work • SWAN [SIGCOMM’13] • maximizing the network utilization • Tunnel-based traffic engineering • Reitblatt et al. [SIGCOMM’12] • Control plane consistency during network updates • Per-packet and per-flow cannot guarantee “no congestions” • Raza et al. [ToN’2011], Ghorbani et al. [HotSDN’12] • One a specific scenario (IGP update, VM migration) • One link weight change or one VM migration at a time

More Related