Optimizing Cloud Data Centers with VL2 Control System

CS434/534: Topics in Network SystemsCloud Data Centers: VL2 Control;VLB/ECMP Load Balancing RoutingYang (Richard) YangComputer Science DepartmentYale University208A WatsonEmail: yry@cs.yale.eduhttp://zoo.cs.yale.edu/classes/cs434/ Acknowledgement: slides include content from classes by M. Alizadeh, and Presto authors.

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Overview • Topology • Control • layer 2 semantics • ECMP/VLB load balancing/performance isolation • Extension: Presto

Admin • PS1 status • Please set up meetings on potential projects

Recap: Data Centers • Largest cost component of data center (DC) is servers, but utilization of servers is often low • Goal of a DC infrastructure: agility • Turn the servers into a single large fungible pool • Dynamically expand and contract service footprint as needed

Recap: Problems of Conventional DC Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 Key • CR = Core Router (L3) • AR = Access Router (L3) • S = Ethernet Switch (L2) • A = Rack of app. servers S S . . . S S S S … … A A A A A A ~ 1,000 servers/pod == IP subnet • Heterogenous server-to-server capacity • Partition by IP subnet limits agility • Poor reliability

Recap: Objectives of VL2 • Layer-2 semantics: • Easily assign any server to any service • Assigning servers to service should be independent of network topology • Configure server with whatever IP address the service expects • VM keeps the same IP address even after migration • Uniform high capacity: • Maximum rate of server to server traffic flow should be limited only by capacity of network cards • Performance isolation: • Traffic of one service should not be affected by traffic of other services (need the above bound)

Recap: Generic K-ary Fat Tree Topo • Motivated by non-blocking Clos networks • K-ary fat tree: three-layer topology (edge, aggregation and core) • K3/4 # servers • Same # of links between each two layers (Core-Aggr, Aggr-Edge, Edge-Serv) K3/4 K3/4 K3/4 http://www.cs.cornell.edu/courses/cs5413/2014fa/lectures/08-fattree.pdf

Assume • Each Int switch has DI ports; • Each Aggr has DA ports Recap: VL2 Topology VL2 Q: Why not same#? DA / 2Int switches . . . Int DADI/2 . . . Aggr DIAggr switches DADI/2 . . . . . . . . . TOR DI DA/4 TOR 20DADI/4 . . . . . . . . . . . 20 Servers 20 (DI DA/4) servers 8

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Overview • Topology • Control

FatTree Topology is great, But… Does using fat-tree topology to inter-connect racks of servers in itself sufficient—we can use any control plane? • How about traditional layer 2 switching (ARP+Learn) • host churns, ARP flooding, spanning tree removes most capacities ! • How about traditional layer 3 IP routing • shortest path routing to each server • constructing a path for each server as a dst will need large flow tables • assume 10 million virtual endpoints in 500,000 servers in datacenter => 10 m entries, but typical switch has only 640K switch memory, for 32-64K flow entries • aggregation to reduce flow table size • VM cannot move easily as address becomes locator

VL2 Solution to Addressing and Routing: Name-Location Separation Whole network as a L2 domain w/o scaling bottleneck. VL2 DirectoryService … x  ToR2 y  ToR3 z  ToR4 … … x  ToR2 y  ToR3 z  ToR3 … . . . . . . . . . ToR1 ToR2 ToR3 ToR4 ToR3 y payload Lookup & Response y, z y z x ToR4 ToR3 z z payload payload Servers use flat names Routing uses locator (ToR address) 11

Discussion • Requirements on the Directory System? • What is a possible design? DirectoryService … x  ToR2 y  ToR3 z  ToR4 … … x  ToR2 y  ToR3 z  ToR3 …

VL2 Directory System • Write-optimizedReplicated StateMachines using Paxos for reliable updates RSM RSMServers 3. Replicate RSM RSM 4. Ack 2. Set (6. Disseminate) • Read-optimized Directory Servers for fast lookups . . . . . . . . . DirectoryServers DS DS DS 2. Reply 2. Reply 5. Ack 1. Lookup 1. Update Agent Agent “Lookup” “Update” • Q: Stale mappings?

Routing Design Option I0 I2 I1 T1 T2 T3 T4 T5 T6 T3 T5 y z payload payload x z Remaining issue: What are the path(s) for each srcTor/dstTor? 15

Assume • 8 port switches • Each link is 10G Example Int Aggr . . . Q: An example routing which can lead to contention/no isolation?

Assume • 8 port switches • Each link is 10G Example Int 6 8 Aggr . . . Objective: Spread traffic so that no such contention can happen, as long as each host is bounded by interface face card rate.

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Overview • Topology • Control • layer 2 semantics • VLB/ECMP load balancing/performance isolation

Offline: Traditional Valiant Load Balancing for Hose Model http://duda.imag.fr/3at/valiant-2pages.pdf

Valiant Load Balancing Intuition in VL2 Setting (Aggr-Int) i Int 6 8 Aggr . . . a • Alg: spread traffic uniformly to the Int switches • Q: Effect on the example? • Q: Bound (assume DI = DA = 8): • a -> i traffic • ¼ of total a upstream traffic (<= 10G) • i -> a traffic • ¼ of total traffic going down to a (<= 10G)

VLB Realization 1 I0 I2 I1 T1 T2 T3 T4 T5 T6 I0 T5 T3 y z payload payload x z Endhost picks a random Int (e.g., I0) and encap Net ECMP routing; Int switches and ToR switches do decap • How may ECMP use multi paths? Q: all upstream paths and downstream paths? 21

VLB Realization 1: Problem I0 I2 I1 T1 T2 T3 T4 T5 T6 I0 T5 T3 y z payload payload x z Problem: Need to update each host if an Int switch changes state. 22

Final VLB Realization IANY IANY IANY T1 T2 T3 T4 T5 T6 IANY T5 T3 y z payload payload x z VL2: All Int switches assigned the same anycast addr. 23 Q: all upstream paths and downstream paths?

Offline Thinking Int Aggr . . . What is a bad setting if there is no second encap?

Implementation Question IANY IANY IANY T1 T2 T3 T4 T5 T6 IANY T5 T3 y z payload payload x z What are tables and actions at each network node? 25

VL2 Agent in Action H(ft) Int LA dst IP src IP dst IP src IP H(ft) dstToR LA Int (10.1.1.1) dst AA src AA payload (10.0.0.6) ToR (20.0.0.1) (10.0.0.4) ToR (20.0.0.1) VLB ECMP VL2 Agent 26

Question to Think Offline • “In 3.2, the paper states that randomizing large flows won't cause much perpetual congestion if misplaced since large flows are only 100 MB and thus take 1 second to transmit on a 1 Gbps link. Isn't 1 second sufficiently high to harm the isolation that VL2 tries to provide?”

Summary: VL2 Objectives and Solutions Solution Objective Flat address; Name-location separation & resolution service 1. Layer-2 semantics Multi-root tree topology 2. Uniformhigh capacity between servers Flow-based random traffic indirection(Valiant LB) 3. Performance Isolation

Evaluation • Uniform high capacity: • All-to-all data shuffle stress test: • 75 servers, deliver 500MB • Maximal achievable goodput is 62.3 • VL2 network efficiency as 58.8/62.3 = 94%

Evaluation • Performance isolation: • Two types of services: • Service one: 18 servers do single TCP transfer all the time • Service two: 19 servers starts a 8GB transfer over TCP every 2 seconds • Service two: 19 servers burst short TCP connections

Critique • Extra servers are needed to support the VL2 directory system • Brings more cost on devices • All links and switches are working all the times, not power efficient • Effectiveness of isolation (load balancing) through VLB/ECMP randomization depends on traffic model

Randomization and Load Balancing: Intuition Load Balancing vs Item Sizes 20×1Gbps Uplinks 2×10Gbps Uplinks Prob of 100% throughput = 3.27% 1 2 20 Prob of 100% throughput = 99.95% 1 2 11×1Gbps flows (55% load) 33

Randomization and Load Balancing: In Implementation VL2 realizes randomization through ECMP hash of 5 tuples Collision happens when there is a hash collision H(f) % 3 = 0

Discussion • When may randomization lb perform badly? • How to reduce/avoid bad lb?

Outline • Admin and recap • Cloud data center (CDC) networks • Background, high-level goal • Traditional CDC vs the one-big switch abstraction • VL2 design and implementation • Overview • Topology • Control • layer 2 semantics • ECMP/VLB load balancing/performance isolation • Extension: Presto

Presto in Context • ECMP: Per-flow lb • Elephant collisions • Per-packet • High computational overhead • Heavy reordering including mice flows • Flowlets • Burst of packets separated by inactivity timer • Effectiveness depends on workloads small large inactivity timer A lot of reordering Mice flows fragmented Large flowlets (hash collisions)

Presto LB Granularity: Flowcells • What is flowcell? • A set of TCP segments with bounded byte count • How to choose flowcell size? • Implementation feasibility • TCP Segmentation Offload (TSO) size • Maximize the benefit of TSO for high speed • 64KB in implementation

Intro to TSO TCP/IP Large Segment NIC Segmentation & Checksum Offload MTU-sized Ethernet Frames TSO important for software: w/o TSO, a host incurs 100% utilization of one CPU core and can only achieve around 5.5 Gbps TCP segments 25KB 30KB 30KB Flowcell: 55KB 39 Start

Presto at a High Level Spine Set up multiple paths Leaf Sender breaks data into flowcells NIC NIC vSwitch vSwitch TCP/IP TCP/IP Receiver masks packet reordering due to multipathing below transport layer

Presto Sender Spine Leaf Controller installs label-switched paths NIC NIC vSwitch vSwitch TCP/IP TCP/IP Host A Host B

Presto Sender Spine NIC uses TSO and chunks segment #1 into MTU-sized packets Leaf flowcell #1: vSwitch encodes flowcell ID, rewrites label id,label NIC NIC vSwitch 50KB vSwitch vSwitch receives TCP segment #1 TCP/IP TCP/IP Host A Host B

Presto Sender Spine NIC uses TSO and chunks segment #2 into MTU-sized packets Leaf flowcell #2: vSwitch encodes flowcell ID, rewrites label id,label NIC NIC 60KB vSwitch vSwitch vSwitch receives TCP segment #2 TCP/IP TCP/IP Host A Host B

Benefits • Most flows smaller than 64KB [Benson, IMC’11] • the majority of mice are not exposed to reordering • Most bytes from elephants [Alizadeh, SIGCOMM’10] • traffic routed on uniform sizes • Fine-grained and deterministic scheduling over disjoint paths • near optimal load balancing

Discussion IANY IANY IANY T1 T2 T3 T4 T5 T6 IANY T5 T3 y z payload payload x z Is it possible for Presto to still send too much traffic on a link? 45

Backup Slides

Presto Receiver • Major challenges • Packet reordering for large flows due to multipath • Distinguish loss from reordering • Fast (10G and beyond) • Light-weight

Intro to GRO • Generic Receive Offload (GRO) • The reverse process of TSO

Intro to GRO TCP/IP OS GRO NIC Hardware

Intro to GRO TCP/IP GRO MTU-sized Packets NIC P1 P2 P3 P4 P5 Queue head

Intro to GRO TCP/IP Merge GRO MTU-sized Packets NIC P1 P2 P3 P4 P5 Queue head

Intro to GRO TCP/IP Merge GRO P1 MTU-sized Packets NIC P2 P3 P4 P5 Queue head

Optimizing Cloud Data Centers with VL2 Control System