340 likes | 598 Views
A Scalable, Commodity Data Center Network Architecture. Jingyang Zhu. Outline. Motivation Background Fat Tree Architecture Topo Routing Fault Tolerent Results. Motivation. Map Reduce. Large Data Shuffle. Intuitive Approach. High End Hardware (e.g., InfiniBand). Alternative Approach.
E N D
A Scalable, Commodity Data Center Network Architecture Jingyang Zhu
Outline • Motivation • Background • Fat Tree Architecture • Topo • Routing • Fault Tolerent • Results
Motivation Map Reduce Large Data Shuffle
Intuitive Approach • High End Hardware (e.g., InfiniBand)
Alternative Approach • A dedicated interconnection network • Scalablilty • Cost • Compability (i.e., app, os, hardware)
Clos Network (m, n, r) = (5, 3, 4) 1. strictly non-blocking (m >= 2n - 1) 2. rearrangeably non-blocking (m >= n)
Benes Network • A Clos Network with 2x2 switches
Fat Tree • Multi-path • Routing: UpLink (right) + DownLink (left) • Oversubscription: ideal BW / actual BW of host end. e.g. 1 : 1 is good; 5 : 1 is bad Node 1 (0001) -> Node 6 (0110): 2 possible paths
Topo of Data Center - Hierachy 10 GigE Link GigE Link Multi-Path = 2 Conventional Topo in Data Center
Topo of Data Center - Fat Tree (k/2)^2 k-port k/2 k-port k/2 k-port k pods Fat Tree Topo (k = 4) # of hosts: k^3 / 4, e.g., k = 48 => # of hosts: 27648 (Scalability!!!)
Addressing - Compability!!! • Pod switches: 10.pod #.switch #.1 • Core Address: 10.k.j.i (k - radix, <j, i> - coordinate) j,i = 1,2,...,k/2 10.1.2.1 10.1.3.1 10.3.2.1 10.3.3.1 10.0.3.1 10.2.3.1 10.2.1.1 10.1.1.1 10.3.0.1 10.3.1.1 10.1.0.1 10.0.0.1
Addressing (con't) • Host: 10.pod #.switch #.ID switch 1 switch 0 switch 1 switch 0 10.0.1.3 10.0.0.3 10.1.1.3 10.0.0.2 10.1.1.2 10.1.0.3 10.1.0.2 • Addressing Format is for further routing purpose
2-level table routing - pod switch Downlink to Host Uplink to Core • 24 - MSB • 8 - LSB • Traffic diffusion occurs only in the first half of a packet’s journey 10.2.1.3 10.2.1.2
Generation of routing table • addPrefix • (pod switch, pre, port) • addSuffix • (pod switch, suf, port)
Routing Table Implementation • Content Addressable Memory (CAM) • Input: data; output: match / mismatch
Routing Table Implementation (con't) 00 Match 10.2.0.3 RAM Address 1001 Host Address
Routing Example: Hierarchical Tree 10.0.1.2 -> 10.2.0.3 10.0.1.3 -> 10.2.0.2
Routing Example: Fat Tree 10.0.1.2 -> 10.2.0.3 No Contention!!! 10.0.1.3 -> 10.2.0.2
Dynamic Routing • Up to now, the routing alg is based on static table...any improvement??? • Yes, using dynamic routing • Dynamic Routing • Flow Classification • Flow Scheduling
Dynamic Routing 1 - Flow Classification • Flow: A set of packets that must have its order preserved • Dynamic Routing • Avoid reordering for same flow • Reassign a minimum number of flows to minimize the disparity between ports • Flow Classifier: identify flows
Flow Classification • Check src & dst address • Balance the port load dynamically Every t seconds to rearrange flows Max 3 flows to be rearranged Avoid reordering Balance the port DYNAMICALLY Have some risks to reorder the flow!!! - For performance consideration, not for correctness
Dynamic Routing 2 - Flow Scheduling • Large flows are critical - schedule the large flows independently // edge switches if (length (flow_in) > threshold) notify central schedular else route as normal // central schedular if (receive notification) foreach possible path if (path not reserved) reserve the path & notify switches along the path
Discussion • Which one is better? • Flow classification • Flow scheduling Locally, inter pod switch Locally, inter pod switch Globally, among all the paths and switches Globally, among all the paths and switches
Fault Tolerance • How to know links or switches fail?Bidirectional Forwarding Detection (BFD)
Fault Tolerance (con't) • Basic ideas • Mark the link unavailable when routing, e.g., marking the load inf in flow classification • Broadcast the fault to other switches and avoid routing it
Cost 1:1
Power & Heat 10 GigE Power and Heat for different switches
Performance Percentage of ideal bisection bandwidth Different Benchmarks
Conclusion • Fat tree for data center interconnection • Scalable • Cost efficient • Compatible • Routing details, locally & globally • Fault tolerant