370 likes | 511 Views
Data Center Networks. CS 401/601 Computer Network Systems Mehmet Gunes. Slides modified from: Mohammad Alizadeh , Albert Greenberg, Changhoon Kim, Srinivasan Seshan. What are Data Centers?. Large facilities with 10s of thousands of networked servers
E N D
Data Center Networks CS 401/601 Computer Network Systems Mehmet Gunes Slides modified from: Mohammad Alizadeh, Albert Greenberg, Changhoon Kim, Srinivasan Seshan
What are Data Centers? • Large facilities with 10s of thousands of networked servers • Compute, storage, and networking working in concert • “Warehouse-Scale Computers” • Huge investment: ~ 0.5 billion for large datacenter
Data Center Costs The Cost of a Cloud: Research Problems in Data Center Networks. SigcommCCR 2009. Greenberg, Hamilton, Maltz, Patel. • *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money
Server Costs 30% utilization considered “good” in most data centers! • Uneven application fit • Each server has CPU, memory, disk: • most applications exhaust one resource, stranding the others • Uncertainty in demand • Demand for a new service can spike quickly • Risk management • Not having spare servers to meet demand brings failure just when success is at hand
Goal: Agility – Any service, Any Server • Turn the servers into a single large fungible pool • Dynamically expand and contract service footprint as needed • Benefits • Lower cost (higher utilization) • Increase developer productivity • Achieve high performance and reliability
Datacenter Networks Provide the illusion of “One Big Switch” Storage (Disk, Flash, …) Compute 10,000sof ports
Datacenter Traffic Growth • Today: Petabits/s in one DC • More than core of the Internet! • Source: “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”, SIGCOMM 2015.
Latency is King Who does she know? Large-scale Web Application Traditional Application What has she done? • 1 user request 1000s of messages over DC network • Microseconds of latency matter • Even at the tail (e.g., 99.9th percentile) App. Logic App Logic Alice << 1µs latency App Tier DataStructures 10μs-1ms latency App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic App Logic Fabric Single machine Data Tier Eric Minnie Pics Videos Apps Data Center • Based on slide by John Ousterhout (Stanford)
Datacenter Arms Race • Amazon, Google, Microsoft, Yahoo!, … race to build next-gen mega-datacenters • Industrial-scale Information Technology • 100,000+ servers • Located where land, water, fiber-optic connectivity, and cheap power are available
DC Networks — L2 pros, cons? — L3 pros, cons? Internet CR CR DC-Layer 3 . . . AR AR AR AR DC-Layer 2 S S • Key • CR = Core Router (L3) • AR = Access Router (L3) • S = Ethernet Switch (L2) • A = Rack of app. servers . . . S S S S … … A A A A A A ~ 1,000 servers/pod == IP subnet Reference – “Data Center: Load balancing Data Center Services”, Cisco 2004
Reminder: Layer 2 vs. Layer 3 • Ethernet switching (layer 2) • Fixed IP addresses and auto-configuration (plug & play) • Seamless mobility, migration, and failover • Broadcast limits scale (ARP) • No multipath (Spanning Tree Protocol) • IP routing (layer 3) • Scalability through hierarchical addressing • Multipath routing through equal-cost multipath • Can’t migrate w/o changing IP address • Complex configuration
Data center networks • load balancer: application-layer routing • receives external client requests • directs workload within data center • returns results to external client (hiding data center internals from client) Internet Border router Load balancer Load balancer Access router Tier-1 switches B A C Tier-2 switches TOR switches Server racks 7 6 5 4 8 3 2 1
Scaling a LAN network • Self-learning Ethernet switches work great at small scales, but buckle at larger scales • Broadcast overhead of self-learning linear in the total number of interfaces • Broadcast storms possible in non-tree topologies • Goals • Scalability to a very large number of machines • Isolation of unwanted traffic from unrelated subnets • Ability to accommodate general types of workloads (Web, database, MapReduce, scientific computing, etc.)
Data center networks • rich interconnection among switches, racks: • increased throughput between racks (multiple routing paths possible) • increased reliability via redundancy Tier-1 switches Tier-2 switches TOR switches Server racks 7 6 5 4 8 3 2 1
Broad questions • How are massive numbers of commodity machines networked inside a data center? • Virtualization: How to effectively manage physical machine resources across client virtual machines? • Operational costs: • Server equipment • Power and cooling
PortLand: Location Discovery Protocol • Location Discovery Messages (LDMs) exchanged between neighboring switches • Switches self-discover location on boot up
Data Center Packet Transport • Large purpose-built DCs • Huge investment: • R&D • business • Transport inside the DC • TCP rules • 99.9% of traffic
TCP in the Data Center • TCP does not meet demands of apps. • Suffers from bursty packet drops, Incast, ... • Builds up large queues: • Adds significant latency. • Wastes precious buffers, esp. bad with shallow-buffered switches. • Operators work around TCP problems • Ad-hoc, inefficient, often expensive solutions • No solid understanding of consequences, tradeoffs
Partition/Aggregate Application Structure Deadline = 250ms MLA MLA TLA Picasso • Time is money • Strict deadlines (SLAs) • Missed deadline • Lower quality result ……… 1. Art is a lie… 1. 1. Deadline = 50ms 2. The chief… 2. Art is a lie… 2. Art is… ….. 3. ….. ….. 3. 3. Picasso “I'd like to live as a poor man with lots of money.“ “The chief enemy of creativity is good sense.“ “Computers are useless. They can only give you answers.” “Bad artists copy. Good artists steal.” “Art is a lie that makes us realize the truth. “It is your work in life that is the ultimate seduction.“ “Inspiration does exist, but it must find you working.” “Everything you can imagine is real.” Deadline = 10ms Worker Nodes
Generality of Partition/Aggregate • The foundation for many large-scale web applications. • Web search, Social network composition, Ad selection, etc. • Example: Facebook • Partition/Aggregate ~ Multiget • Aggregators: Web Servers • Workers: Memcached Servers Internet Web Servers Memcached Protocol Memcached Servers
Workloads • Partition/Aggregate (Query) • Short messages [50KB-1MB] (Coordination, Control state) • Large flows [1MB-50MB] (Data update) Delay-sensitive Delay-sensitive Throughput-sensitive
Tension Between Requirements High Throughput Low Latency High Burst Tolerance • Deep Buffers: • Queuing Delays • Increase Latency • Shallow Buffers: • Bad for Bursts & • Throughput Objective: Low Queue Occupancy & High Throughput • AQM – RED: • Avg Queue Not Fast • Enough for Incast • Reduced RTOmin • Doesn’t Help Latency
Review: The TCP/ECN Control Loop Sender 1 ECN = Explicit Congestion Notification ECN Mark (1 bit) Receiver Sender 2
Two Key Ideas • React in proportion to the extent of congestion, not its presence • Reduces variance in sending rates, lowering queuing requirements • Mark based on instantaneous queue length. • Fast feedback to better deal with bursts.
DCTCP in Action (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB
Why it Works • High Burst Tolerance • Large buffer headroom → bursts fit • Aggressive marking → sources react before packets are dropped • Low Latency • Small buffer occupancies → low queuing delay • 3. High Throughput • ECN averaging → smooth rate adjustments, low variance
Current solutions for increasing data center network bandwidth FatTree BCube 1. Hard to construct 2. Hard to expand
Fat-Tree • Inter-connect racks (of servers) using a fat-tree topology • Fat-Tree: a special type of Clos Networks (after C. Clos) K-ary fat tree: three-layer topology (edge, aggregation and core) • each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches • each edge switch connects to k/2 servers & k/2 aggr. switches • each aggr. switch connects to k/2 edge & k/2 core switches • (k/2)2 core switches: each connects to k pods
Fat-Tree Fat-tree with K=4
Why Fat-Tree? • Fat tree has identical bandwidth at any bisections • Each layer has the same aggregated bandwidthCan be built using cheap devices with uniform capacity • Each port supports same speed as end host • All devices can transmit at line speed if packets are distributed uniform along available paths • Great scalability: k-port switch supports k3/4 servers Fat tree network with K = 3 supporting 54 hosts