Hedera : Dynamic Flow Scheduling for Data Center Networks

Hedera: Dynamic Flow Scheduling for Data Center Networks • Mohammad Al-Fares • SivasankarRadhakrishnan • BarathRaghavan • Nelson Huang • Amin Vahdat • Presented by 馬嘉伶

Easy to understand problem MapReduce style DC applications need bandwidth DC networks have many ECMP paths between servers Flow-hash-based load balancing insufficient

ECMP Paths • Many equal cost paths going up to the core switches • Only one path down from each core switch • Randomly allocate paths to flows using hash of the flow • Agnostic to available resources • Long lasting collisions between long (elephant) flows S D

Collisions of elephant flows • Collisions possible in two different ways • Upward path • Downward path S1 S2 S3 D2 S4 D1 D4 D3

Collisions of elephant flows • Average of 61% of bisection bandwidth wasted on a network of 27K servers S1 S2 S3 D2 S4 D1 D4 D3

Hedera Scheduler • Hedera: Dynamic Flow Scheduling - Optimize achievable bisection bandwidth by assigning flow non-conflicting paths - Uses flow demand estimation + placement heuristics to find good flow-to-core mappings Estimate Flow Demands Detect Large Flows Place Flows

Hedera Scheduler • Detect Large Flows • Flows that need bandwidth but are network-limited • Estimate Flow Demands • Use max-min fairness to allocate flows between src-dst pairs • Place Flows • Use estimated demands to heuristically find better placement of large flows on the ECMP paths Estimate Flow Demands Detect Large Flows Place Flows

Elephant Detection • Scheduler continually polls edge switches for flow byte-counts • Flows exceeding B/s threshold are “large” • > %10 of hosts’ link capacity (i.e. > 100Mbps) • What if only “small” flows? • Default ECMP load-balancing efficient. GigE

Elephant Detection • Hyderacomplements ECMP! - Default forwarding uses ECMP - Hedera schedules large flows that cause bisection bandwidth problems.

Demand Estimation • Flows can be constrained in two ways • Host-limited (at source, or at destination) • Network-limited • Measured flow rate is misleading • Need to find a flow’s “natural” bandwidth requirement when not limited by the network • Forget network, just allocate capacity between flows using max-min fairness

Demand Estimation • Given traffic matrix of large flows, modify each flow’s size at it source and destination iteratively… • Sender equally distributes bandwidth among outgoing flows that are not receiver-limited • Network-limited receivers decrease exceeded capacity equally between incoming flows • Repeat until all flows converge • Guaranteed to converge in O(|F|) time

Demand Estimation A X B Y C Senders

Demand Estimation A X B Y C Receivers

Demand Estimation A X B Y C Senders

Demand Estimation A X B Y C Receivers

Flow placement • Find a good allocation of paths for the set of large flows, such that the average bisection bandwidth of the flows is maximized • That is maximum utilization of theoretically available b/w • Two approaches • Global First Fit: Greedily choose path that has sufficient unreserved b/w • Simulated Annealing: Iteratively find a globally better mapping of paths to flows

Global First-Fit Scheduler ? ? ? • New flow detected, linearly search all possible paths from SD • Place flow on first path whose component links can fit that flow Flow A 0 1 2 3 Flow B Flow C

Global First-Fit Scheduler • Flows placed upon detection, are not moved • Once flow ends, entries + reservations time out Flow A 0 1 2 3 Flow B Flow C

Simulated Annealing • Probabilistic search for good flow-to-core mappings - Goal: Maximize achievable bisection bandwidth • Current flow-to-core mapping generates neighbor state - Calculate total exceeded bandwidth capacity - Accept move to neighbor state if bisection BW gain • Few thousand iterations for each scheduling round - Avoid local-minima; non-zero prob. to worse state

Simulated Annealing

Simulated Annealing Scheduler ? ? ? ? • Example run: 3 flows, 3 iterations Core Flow A 2 2 2 0 1 2 3 Flow B 1 0 0 Flow C 0 2 3

Simulated Annealing Scheduler ? ? ? ? • Final state is published to the switches and used as the initial state for next round Core Flow A 2 0 1 2 3 Flow B 0 Flow C 3

Fault-Tolerance Scheduler • Link / Switch failure • Use PortLand’s fault notification protocol • Hedera routes around failed components Flow A 0 1 2 3 Flow B Flow C

Fault-Tolerance Scheduler • Scheduler failure • Soft-state, not required for correctness (connectivity) • Switches fall back to ECMP Flow A 0 1 2 3 Flow B Flow C

Evaluation • 16-host testbed - k=4 fat-tree data-plane - 20 machines; 4-port NetFGPAs/ OpenFlow - Parallel 48-port non-blocking Quanta switch(debug & control) • 1 Scheduler machine - Dynamic traffic monitoring - OpenFlowrouting control

Evaluation

Evaluation Data Shuffle • 16-hosts: 120 GB all-to-all in-memory shuffle • Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

Conclusions • Simulated Annealing delivers significant bisection BW gains over standard ECMP • Hederacomplements ECMP • If youare running MapReduce/Hadoop jobs on your network, you stand to benefit greatly from Hedera; tiny investment!

Hedera : Dynamic Flow Scheduling for Data Center Networks