1 / 28

Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera : Dynamic Flow Scheduling for Data Center Networks. Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan Nelson Huang Amin Vahdat. Presented by 馬嘉伶. Easy to understand problem MapReduce style DC applications need bandwidth DC networks have many ECMP paths between servers

york
Download Presentation

Hedera : Dynamic Flow Scheduling for Data Center Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hedera: Dynamic Flow Scheduling for Data Center Networks • Mohammad Al-Fares • SivasankarRadhakrishnan • BarathRaghavan • Nelson Huang • Amin Vahdat • Presented by 馬嘉伶

  2. Easy to understand problem MapReduce style DC applications need bandwidth DC networks have many ECMP paths between servers Flow-hash-based load balancing insufficient

  3. ECMP Paths • Many equal cost paths going up to the core switches • Only one path down from each core switch • Randomly allocate paths to flows using hash of the flow • Agnostic to available resources • Long lasting collisions between long (elephant) flows S D

  4. Collisions of elephant flows • Collisions possible in two different ways • Upward path • Downward path S1 S2 S3 D2 S4 D1 D4 D3

  5. Collisions of elephant flows • Average of 61% of bisection bandwidth wasted on a network of 27K servers S1 S2 S3 D2 S4 D1 D4 D3

  6. Hedera Scheduler • Hedera: Dynamic Flow Scheduling - Optimize achievable bisection bandwidth by assigning flow non-conflicting paths - Uses flow demand estimation + placement heuristics to find good flow-to-core mappings Estimate Flow Demands Detect Large Flows Place Flows

  7. Hedera Scheduler • Detect Large Flows • Flows that need bandwidth but are network-limited • Estimate Flow Demands • Use max-min fairness to allocate flows between src-dst pairs • Place Flows • Use estimated demands to heuristically find better placement of large flows on the ECMP paths Estimate Flow Demands Detect Large Flows Place Flows

  8. Elephant Detection • Scheduler continually polls edge switches for flow byte-counts • Flows exceeding B/s threshold are “large” • > %10 of hosts’ link capacity (i.e. > 100Mbps) • What if only “small” flows? • Default ECMP load-balancing efficient. GigE

  9. Elephant Detection • Hyderacomplements ECMP! - Default forwarding uses ECMP - Hedera schedules large flows that cause bisection bandwidth problems.

  10. Demand Estimation • Flows can be constrained in two ways • Host-limited (at source, or at destination) • Network-limited • Measured flow rate is misleading • Need to find a flow’s “natural” bandwidth requirement when not limited by the network • Forget network, just allocate capacity between flows using max-min fairness

  11. Demand Estimation • Given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively… • Sender equally distributes bandwidth among outgoing flows that are not receiver-limited • Network-limited receivers decrease exceeded capacity equally between incoming flows • Repeat until all flows converge • Guaranteed to converge in O(|F|) time

  12. Demand Estimation A X B Y C Senders

  13. Demand Estimation A X B Y C Receivers

  14. Demand Estimation A X B Y C Senders

  15. Demand Estimation A X B Y C Receivers

  16. Flow placement • Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized • That is maximum utilization of theoretically available b/w • Two approaches • Global First Fit: Greedily choose path that has sufficient unreserved b/w • Simulated Annealing: Iteratively find a globally better mapping of paths to flows

  17. Global First-Fit Scheduler ? ? ? • New flow detected, linearly search all possible paths from SD • Place flow on first path whose component links can fit that flow Flow A 0 1 2 3 Flow B Flow C

  18. Global First-Fit Scheduler • Flows placed upon detection, are not moved • Once flow ends, entries + reservations time out Flow A 0 1 2 3 Flow B Flow C

  19. Simulated Annealing • Probabilistic search for good flow-to-core mappings - Goal: Maximize achievable bisection bandwidth • Current flow-to-core mapping generates neighbor state - Calculate total exceeded bandwidth capacity - Accept move to neighbor state if bisection BW gain • Few thousand iterations for each scheduling round - Avoid local-minima; non-zero prob. to worse state

  20. Simulated Annealing

  21. Simulated Annealing Scheduler ? ? ? ? • Example run: 3 flows, 3 iterations Core Flow A 2 2 2 0 1 2 3 Flow B 1 0 0 Flow C 0 2 3

  22. Simulated Annealing Scheduler ? ? ? ? • Final state is published to the switches and used as the initial state for next round Core Flow A 2 0 1 2 3 Flow B 0 Flow C 3

  23. Fault-Tolerance Scheduler • Link / Switch failure • Use PortLand’s fault notification protocol • Hedera routes around failed components Flow A 0 1 2 3 Flow B Flow C

  24. Fault-Tolerance Scheduler • Scheduler failure • Soft-state, not required for correctness (connectivity) • Switches fall back to ECMP Flow A 0 1 2 3 Flow B Flow C

  25. Evaluation • 16-host testbed - k=4 fat-tree data-plane - 20 machines; 4-port NetFGPAs/ OpenFlow - Parallel 48-port non-blocking Quanta switch(debug & control) • 1 Scheduler machine - Dynamic traffic monitoring - OpenFlowrouting control

  26. Evaluation

  27. Evaluation Data Shuffle • 16-hosts: 120 GB all-to-all in-memory shuffle • Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

  28. Conclusions • Simulated Annealing delivers significant bisection BW gains over standard ECMP • Hederacomplements ECMP • If youare running MapReduce/Hadoop jobs on your network, you stand to benefit greatly from Hedera; tiny investment!

More Related