Data Center Networking

Data Center Networking CS 6250: Computer NetworkingFall 2011

Cloud Computing • Elastic resources • Expand and contract resources • Pay-per-use • Infrastructure on demand • Multi-tenancy • Multiple independent users • Security and resource isolation • Amortize the cost of the (shared) infrastructure • Flexibility service management • Resiliency: isolate failure of servers and storage • Workload movement: move work to other locations

Trend of Data Center By J. Nicholas Hoover , InformationWeek June 17, 2008 04:00 AM 200 million Euro http://gigaom.com/cloud/google-builds-megadata-center-in-finland/ http://www.datacenterknowledge.com • Data centers will be larger and larger in the future cloud computing era to benefit from commodities of scale. • Tens of thousand  Hundreds of thousands in # of servers • Most important things in data center management • Economies of scale • High utilization of equipment • Maximize revenue • Amortize administration cost • Low power consumption (http://www.energystar.gov/ia/partners/prod_development/downloads/EPA_Datacenter_Report_Congress_Final1.pdf)

Cloud Service Models • Software as a Service • Provider licenses applications to users as a service • e.g., customer relationship management, email, … • Avoid costs of installation, maintenance, patches, … • Platform as a Service • Provider offers software platform for building applications • e.g., Google’s App-Engine • Avoid worrying about scalability of platform • Infrastructure as a Service • Provider offers raw computing, storage, and network • e.g., Amazon’s Elastic Computing Cloud (EC2) • Avoid buying servers and estimating resource needs

Multi-Tier Applications • Applications consist of tasks • Many separate components • Running on different machines • Commodity computers • Many general-purpose computers • Not one big mainframe • Easier scaling Front end Server Aggregator … … Aggregator Aggregator Aggregator … … Worker Worker Worker Worker Worker

Enabling Technology: Virtualization • Multiple virtual machines on one physical machine • Applications run unmodified as on real machine • VM can migrate from one computer to another

Status Quo: Virtual Switch in Server

Top-of-Rack Architecture • Rack of servers • Commodity servers • And top-of-rack switch • Modular design • Preconfigured racks • Power, network, andstorage cabling • Aggregate to the next level

Modularity • Containers • Many containers

Data Center Challenges • Traffic load balance • Support for VM migration • Achieving bisection bandwidth • Power savings / Cooling • Network management (provisioning) • Security (dealing with multiple tenants)

Data Center Costs (Monthly Costs) • Servers: 45% • CPU, memory, disk • Infrastructure: 25% • UPS, cooling, power distribution • Power draw: 15% • Electrical utility costs • Network: 15% • Switches, links, transit http://perspectives.mvdirona.com/2008/11/28/CostOfPowerInLargeScaleDataCenters.aspx

Internet Layer-3 router Core Aggregation Layer-2/3 switch Access Layer-2 switch Servers Common Data Center Topology Data Center

Data Center Network Topology Internet CR CR . . . AR AR AR AR S S . . . S S S S • Key • CR = Core Router • AR = Access Router • S = Ethernet Switch • A = Rack of app. servers … … A A A A A A ~ 1,000 servers/pod

Requirements for future data center • To catch up with the trend of mega data center, DCN technology should meet the requirements as below • High Scalability • Transparent VM migration (high agility) • Easy deployment requiring less human administration • Efficient communication • Loop free forwarding • Fault Tolerance • Current DCN technology can’t meet the requirements. • Layer 3 protocol can not support the transparent VM migration. • Current Layer 2 protocol is not scalable due to the size of forwarding table and native broadcasting for address resolution.

Problems with Common Topologies • Single point of failure • Over subscription of links higher up in the topology • Tradeoff between cost and provisioning

Capacity Mismatch CR CR ~ 200:1 AR AR AR AR S S S S ~ 40:1 . . . S S S S S S S S ~ 5:1 … … A A A A A A … … A A A A A A

Data-Center Routing Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 S S S S . . . S S S S S S S S • Key • CR = Core Router (L3) • AR = Access Router (L3) • S = Ethernet Switch (L2) • A = Rack of app. servers … … A A A A A A ~ 1,000 servers/pod == IP subnet

Reminder: Layer 2 vs. Layer 3 • Ethernet switching (layer 2) • Cheaper switch equipment • Fixed addresses and auto-configuration • Seamless mobility, migration, and failover • IP routing (layer 3) • Scalability through hierarchical addressing • Efficiency through shortest-path routing • Multipath routing through equal-cost multipath • So, like in enterprises… • Data centers often connect layer-2 islands by IP routers

Need for Layer 2 • Certain monitoring apps require server with same role to be on the same VLAN • Using same IP on dual homed servers • Allows organic growth of server farms • Migration is easier

Review of Layer 2 & Layer 3 • Layer 2 • One spanning tree for entire network • Prevents loops • Ignores alternate paths • Layer 3 • Shortest path routing between source and destination • Best-effort delivery

FAT Tree-Based Solution • Connect end-host together using a fat-tree topology • Infrastructure consist of cheap devices • Each port supports same speed as endhost • All devices can transmit at line speed if packets are distributed along existing paths • A k-port fat tree can support k3/4 hosts

Fat-Tree Topology

Problems with a Vanilla Fat-tree • Layer 3 will only use one of the existing equal cost paths • Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity

Modified Fat Tree • Enforce special addressing scheme in DC • Allows host attached to same switch to route only through switch • Allows inter-pod traffic to stay within pod • unused.PodNumber.switchnumber.Endhost • Use two level look-ups to distribute traffic and maintain packet ordering.

Two-Level Lookups • First level is prefix lookup • Used to route down the topology to endhost • Second level is a suffix lookup • Used to route up towards core • Diffuses and spreads out traffic • Maintains packet ordering by using the same ports for the same endhost

Diffusion Optimizations • Flow classification • Eliminates local congestion • Assign to traffic to ports on a per-flow basis instead of a per-host basis • Flow scheduling • Eliminates global congestion • Prevent long lived flows from sharing the same links • Assign long lived flows to different links

Drawbacks • No inherent support for VLAN traffic • Data center is fixed size • Ignored connectivity to the Internet • Waste of address space • Requires NAT at border

Data Center Traffic Engineering Challenges and Opportunities

Wide-Area Network . . . . . . Data Centers Servers Servers Router Router DNS Server Internet Clients DNS-based site selection

Wide-Area Network: Ingress Proxies . . . . . . Data Centers Servers Servers Router Router Proxy Proxy Clients

Traffic Engineering Challenges • Scale • Many switches, hosts, and virtual machines • Churn • Large number of component failures • Virtual Machine (VM) migration • Traffic characteristics • High traffic volume and dense traffic matrix • Volatile, unpredictable traffic patterns • Performance requirements • Delay-sensitive applications • Resource isolation between tenants

Traffic Engineering Opportunities • Efficient network • Low propagation delay and high capacity • Specialized topology • Fat tree, Clos network, etc. • Opportunities for hierarchical addressing • Control over both network and hosts • Joint optimization of routing and server placement • Can move network functionality into the end host • Flexible movement of workload • Services replicated at multiple servers and data centers • Virtual Machine (VM) migration

PortLand: Main Idea Add a new host • Key features • Layer 2 protocol based on tree topology • PMAC encode the position information • Data forwarding proceeds based on PMAC • Edge switch’s responsible for mapping between PMAC and AMAC • Fabric manger’s responsible for address resolution • Edge switch makes PMAC invisible to end host • Each switch node can identify its position by itself • Fabric manager keep information of overall topology. Corresponding to the fault, it notifies affected nodes. Transfer a packet

Questions in discussion board [A scalable, commodity data center network architecture Mohammad La-Fares and et el, SIGCOMM ‘08] • Question about Fabric Manager • How to make Fabric Manager robust ? • How to build scalable Fabric Manager ?  Redundant deployment or cluster Fabric manager could be solution • Question about reality of base-line tree topology • Is the tree topology common in real world ?  Yes, multi-rooted tree topology has been a traditional topology in data center.

Discussion points • Is the PortLand applicable to other topology ?  The Idea of central ARP management could be applicable. To solve forwarding loop problem, TRILL header-like method would be necessary. The benefits of PMAC ? It would require larger size of forwarding table. • Question about benefits from VM migration • VM migration helps to reduce traffic going through aggregate/core switches. • How about user requirement change? • How about power consumption ? • ETC • Feasibility to mimic PortLand on layer 3 protocol. • How about using pseudo IP ? • Delay to boot up whole data center

Status Quo: Conventional DC Network Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key • CR = Core Router (L3) • AR = Access Router (L3) • S = Ethernet Switch (L2) • A = Rack of app. servers S S . . . S S S S … … A A A A A A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004

Conventional DC Network Problems CR CR ~ 200:1 AR AR AR AR S S S S ~ 40:1 . . . S S S S S S S S ~ 5:1 … … A A A A A A … … A A A A A A Dependence on high-cost proprietary routers Extremely limited server-to-server capacity

And More Problems … CR CR ~ 200:1 AR AR AR AR S S S S S S S S S S S S A … … A A A A A A … … A A A A A A IP subnet (VLAN) #2 IP subnet (VLAN) #1 • Resource fragmentation, significantly lowering utilization (and cost-efficiency)

And More Problems … CR CR ~ 200:1 AR AR AR AR Complicated manual L2/L3 re-configuration S S S S S S S S S S S S A … … A A A A A A … … A A A A A A IP subnet (VLAN) #2 IP subnet (VLAN) #1 • Resource fragmentation, significantly lowering cloud utilization (and cost-efficiency)

CR CR AR AR AR AR S S S S S S S S S S S S All We Need is Just a Huge L2 Switch,or an Abstraction of One . . . . . . … A … A A A A A A A A A A A A A A A A A A A A A A … … A A A A A A A A A A A A A A

VL2 Approach • Layer 2 based using future commodity switches • Hierarchy has 2: • access switches (top of rack) • load balancing switches • Eliminate spanning tree • Flat routing • Allows network to take advantage of path diversity • Prevent MAC address learning • 4D architecture to distribute data plane information • TOR: Only need to learn address for the intermediate switches • Core: learn for TOR switches • Support efficient grouping of hosts (VLAN replacement)

VL2

VL2 Components • Top-of-Rack switch: • Aggregate traffic from 20 end host in a rack • Performs ip to mac translation • Intermediate Switch • Disperses traffic • Balances traffic among switches • Used for valiant load balancing • Decision Element • Places routes in switches • Maintain a directory services of IP to MAC • End-host • Performs IP-to-MAC lookup

Routing in VL2 • End-host checks flow cache for MAC of flow • If not found ask agent to resolve • Agent returns list of MACs for server and MACs for intermediate routers • Send traffic to Top of Router • Traffic is triple-encapsulated • Traffic is sent to intermediate destination • Traffic is sent to Top of rack switch of destination

The Illusion of a Huge L2 Switch 1. L2 semantics 2. Uniform high capacity 3. Performance isolation … A … A A A A A A A A A A A A A A A A A A A A A A … … A A A A A A A A A A A A A A

Objectives and Solutions Objective Approach Solution Employ flat addressing Name-location separation & resolution service 1. Layer-2 semantics Flow-based random traffic indirection(Valiant LB) Guarantee bandwidth forhose-model traffic 2. Uniformhigh capacity between servers Enforce hose model using existing mechanisms only 3. Performance Isolation TCP

Name/Location Separation Cope with host churns with very little overhead VL2 DirectoryService Switches run link-state routing and maintain only switch-level topology • Allows to use low-cost switches • Protects network and hosts from host-state churn • Obviates host and switch reconfiguration … x  ToR2 y  ToR3 z  ToR4 … … x  ToR2 y  ToR3 z  ToR3 … . . . . . . . . . ToR1 ToR2 ToR3 ToR4 ToR3 y payload Lookup & Response y, z y z x ToR3 ToR4 z z payload payload Servers use flat names

Clos Network Topology Offer huge aggr capacity & multi paths at modest cost VL2 . . . Int . . . Aggr K aggr switches with D ports . . . . . . . . . TOR . . . . . . . . . . . 20 Servers 20*(DK/4) Servers

Valiant Load Balancing: Indirection Cope with arbitrary TMs with very little overhead IANY IANY IANY Links used for up paths • [ ECMP + IP Anycast ] • Harness huge bisection bandwidth • Obviate esoteric traffic engineering or optimization • Ensure robustness to failures • Work with switch mechanisms available today Links usedfor down paths T1 T2 T3 T4 T5 T6 IANY T3 T5 y z payload payload 1. Must spread traffic 2. Must ensure dst independence Equal Cost Multi Path Forwarding x y z

Properties of Desired Solutions • Backwards compatible with existing infrastructure • No changes in application • Support of layer 2 (Ethernet) • Cost-effective • Low power consumption & heat emission • Cheap infrastructure • Allows host communication at line speed • Zero-configuration • No loops • Fault-tolerant

Data Center Networking