1 / 36

NetPilot : Automating Datacenter Network Failure Mitigation

NetPilot : Automating Datacenter Network Failure Mitigation. Xin Wu , Daniel Turner, Chao- Chih Chen, David A. Maltz , Xiaowei Yang, Lihua Yuan, Ming Zhang. Failures are Common and Harmful. Network failures are common. 10,000+ switches. Failures are Common and Harmful.

debra
Download Presentation

NetPilot : Automating Datacenter Network Failure Mitigation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang

  2. Failures are Common and Harmful • Network failures are common 10,000+ switches

  3. Failures are Common and Harmful Six-month failure logs of production datacenters • Network failures are common • Failures cause long down times 25% of failures take 13+hours to repair Time from detection to repair (minutes)

  4. Failures are Common and Harmful • Failures are common due to VERY large datacenters • Failures cause long down times • Long failure duration  large revenue loss

  5. How to Shorten Failure Recovery Time?

  6. Previous Work • Conventional failure recovery takes 3 steps • Failure localization/diagnosis • [M. K. Aguilera, SOSP’03] • [M. Y. Chen, NSDI’04] • [R.R Kompella, NSDI ’05] • [P.Bahl, SIGCOMM’07] • [S. Kandula, SIGCOMM’09]… Detection Diagnosis Repair passive ping active

  7. Automating Failure Diagnosis is Challenging • Root causes are deep in network stack • Diagnosis involves multiple parties

  8. Six-month failure logs fromseveral production DCNs Failure Diagnosis Requires Human Intervention ! • 2. Diagnosis involves multiple parties • 1. Root causes are deep in the network stack

  9. Can we do something other than failure diagnosis?

  10. NetPilot: Mitigating rather than Diagnosing Failures • Mitigatefailure symptoms ASAP, at the cost of reduced capacity Diagnosis Repair Automated Mitigation Detection

  11. NetPilot Benefits • Short recovery time • Small network disruption • Low operation cost Automated Mitigation Diagnosis Repair Detection

  12. Failure Mitigation is Effective • Most failures can be mitigated by simpleactions • Mitigation is feasible due to redundancy

  13. 68% of failures can be mitigated by simple actions

  14. Mitigation Made Possible by Redundancy Internet • Redundancy  deactivation unlikely to partition / overload the network CORE AGG ToR

  15. Outline • Automating failure diagnosis is challenging • Failure mitigation is effective • How to automate mitigation? • NetPilot evaluations • Conclusion

  16. A StrawmanNetPilot: Trial-and-error Network failure Localization Execute an action Roll back if necessary No Failure mitigated? End Yes

  17. NetPilot: Challenges & Solutions Network failure 1. Blind trial-and-error takes a long time Localization Localization Roll back if necessary Failure specific localization Execute an action No Failure mitigated? End Yes

  18. NetPilot: Challenges & Solutions Network failure Localization Localization 2. Partition/overload network Estimate impact Impact estimation Roll back if necessary Execute an action No Failure mitigated? End Yes

  19. NetPilot: Challenges & Solutions Network failure Localization Localization Estimate impact 3. Different actions have different side-effects Rank actions Roll back if necessary Rank actions based on impact Execute an action No Failure mitigated? End Yes

  20. Failure Specific Localization • Limited # of failure types • Domain knowledge improves accuracy

  21. Example: FrameCheckSequence (FCS) Errors • 13% of allthefailures • Cut-throughswitching • Forward frames before checksums are verified • Increaseapplicationlatency

  22. Localizing FCS Errors frames corrupted by L • xL: link corruption rate • # of variables = # of equations = # of links • Corrupted links: xL> 0 error frames seen on L frames corrupted by other links & traverse L

  23. NetPilotOverview Network failure Localization Estimate impact Rank actions Roll back if necessary Execute an action No Failure mitigated? End Yes

  24. Impact Metrics • Derived from Service Level Agreement (SLA) • Availability: online_server_ratio • Packet loss: total_lost_pkt • latency: max_link_utilization • Small link utilization  small (queuing) delay • Total_lost_pkt & max_link_utilization derived from utilization of individual links

  25. Estimating Link Utilization Action • # of flows >> redundant paths • Traffic evenly distributed under ECMP • Estimate the load contributed by each flow on each link • Sum up the loads to compute utilization Impact Estimator Link utilization Traffic Topology

  26. Link Utilization Estimation is Highly Accurate • 1-month traffic from a 8000-server network • Log socket events on each server • Ground truth: SNMP counters

  27. NetPilotOverview Network failure Localization Estimate impact Choose the action with the least impact Rank actions Roll back if necessary Execute an action No Failure mitigated? End Yes

  28. Outline • Automating failure diagnosis is challenging • Failure mitigation is effective • How to automate mitigation? • Localization  impact estimation  ranking • NetPilot evaluations • Mitigating load imbalance • Mitigating FCS errors • Mitigating overload • Conclusion

  29. Load Imbalance corea coreb • Agga stops receiving traffic • Localize to 4 suspects Aggb Agga

  30. Mitigating Load Imbalance corea coreb Aggb Agga corea -> aggb corea -> agga Mitigation confirmed coreb -> aggb coreb -> agga Detected & reboot coreb Agga stops receiving traffic Reboot corea Reboot Agga Load evenly splitted

  31. Fast FCS Error Mitigation 3.5 hours  15 minutes Human operator: after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated NetPilot: deactivates 2 links in 1 trial within 15 minutes

  32. Mitigating Link Overload • Mitigate overload by deactivatinghealthy links 1.5 1.5 core1 core2 agg core1 3 agg

  33. Mitigating Link Overload • Mitigate overload by deactivatinghealthy links • Many candidate links in production networks • Choose the link(s) with the least impact 0 1 1.5 3 1.5 1.5 core1 core1 core1 core2 core2 core2 agg agg agg lost 0.5 3 3 3

  34. Action Ranking Lowers Link Utilization • Replay 97 overload incidents due to link failures

  35. Conclusion • Mitigation shortens failure recovery time • Simple actions are effective • Made possible by redundancy • NetPilot: automating failure mitigation • Recovery time: hour  minutes • Several mitigation scenarios deployed in Bing

  36. Thank You! NetPilot: Automated Mitigation Detection Diagnosis Repair netpilot@microsoft.com

More Related