430 likes | 502 Views
High Performance Routing. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University Abrizio/PMC-Sierra Inc. nickm@stanford.edu http://www.stanford.edu/~nickm. Outline. Outline Review: What is a Router?
E N D
High Performance Routing Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University Abrizio/PMC-Sierra Inc. nickm@stanford.edu http://www.stanford.edu/~nickm 1
Outline • Outline • Review: What is a Router? • The Evolution of Routers • Single-stage switching:The Fork-Join Router 2
Outline • Switching is the bottleneck in a router. • The trend has been to overcome limitations in memory bandwidth: • Shared memory -> Single-stage, crossbar-based, combined input and output queued (CIOQ). • …and reduce power per-rack & per-system: • Single box systems -> Multi-rack systems (LCS). 3
Outline (2) • What comes next? • Multistage switches solve the wrong problem: • N^2 is not the problem. • Multistage switches are more blocking, more power-hungry and less predictable. • Parallel single-stage switches (e.g. the Fork-Join Router) are non-blocking, use less power, can achieve as high capacity, and can be predictable. 4
Outline • Outline • Review: What is a Router? • The Evolution of Routers • Single-stage switching:The Fork-Join Router 5
Admission Control Reservation Packet Classification Output Scheduling Policing & Access Control Ingress Interconnect Egress 1. 2. 3. Basic Architectural Components Routing Protocols Routing Table Control Plane Datapath” per-packet processing Forwarding Table Switching 6
Limitation: Memory b/w Limitation: Interconnect b/w Power & Arbitration Limitation: Memory b/w Basic Architectural ComponentsDatapath: per-packet processing 1. Ingress 2. Interconnect 3. Egress Classifier Table Forwarding Table Policing & Access Control Forwarding Decision Classifier Table Forwarding Table Policing & Access Control Forwarding Decision Classifier Table Forwarding Table Policing & Access Control Forwarding Decision 7
Outline • Outline • Review: What is a Router? • The Evolution of Routers • Single-stage switching:The Fork-Join Router 8
CPU Buffer Memory Route Table CPU Line Interface Line Interface Line Interface Memory MAC MAC MAC First Generation Routers Fixed length “DMA” blocks or cells. Reassembled on egress linecard Shared Backplane Line Interface Fixed length cells or variable length packets Typically <0.5Gb/s aggregate capacity 9
First Generation RoutersQueueing Structure: Shared Memory Numerous work has proven and made possible: • Fairness • Delay Guarantees • Delay Variation Control • Loss Guarantees • Statistical Guarantees Output 1 Input 1 Input 2 Output 2 Large, single dynamically allocated memory buffer: N writes per “cell” time N reads per “cell” time. Limited by memory bandwidth. Input N Output N 10
Slow Path Fwding Cache Second Generation Routers CPU Buffer Memory Route Table Line Card Line Card Line Card Drop Policy Or Backpressure Drop Policy Buffer Memory Buffer Memory Buffer Memory Fwding Cache Fwding Cache Output Link Scheduling MAC MAC MAC Typically <5Gb/s aggregate capacity 11
Fwding Table Second Generation RoutersAs caching became ineffective Exception Processor CPU Route Table Line Card Line Card Line Card Buffer Memory Buffer Memory Buffer Memory Fwding Table Fwding Table MAC MAC MAC 12
1 write per “cell” time 1 read per “cell” time Rate of writes/reads determined by bus speed Second Generation RoutersQueueing Structure: Combined Input and Output Queueing Bus 13
Fwding Table Third Generation Routers Switched Backplane Line Card CPU Card Line Card Local Buffer Memory Local Buffer Memory Line Interface CPU Routing Table Memory Fwding Table MAC MAC Typically <50Gb/s aggregate capacity 14
1 write per “cell” time 1 read per “cell” time Rate of writes/reads determined by switchfabric speedup Third Generation RoutersQueueing Structure Switch 15
Third Generation Routers • Size-constrained: 19” or 23” wide. • Power-constrained: ~<6kW. • QoS unfriendly: input congestion. 7’ Supply: 100A/200A maximum at 48V 19” or 23” 16
The LCS Protocol Up to 2km Fourth Generation Routers/Switches Optical links Switch Core Linecards 17
Fourth Generation Routers/SwitchesThe LCS Protocol What is LCS? • Credit-based flow control: enables separation. • Label-based multicast: enables scaling. Its Benefits • Large Number of Ports.Separation enables large number of ports in multiple racks. • Minimizes Switch Core Complexity and Power.Switch core can be bufferless and lossless. QoS, discard etc. performed on linecard. 18
Rate of writes/reads determined by switchfabric speedup Fourth Generation Routers/SwitchesQueueing Structure Virtual Output Queues 1 read per “cell” time 1 write per “cell” time Lookup & Drop Policy Output Scheduling Switch Fabric Lookup & Drop Policy Output Scheduling Switch Arbitration Lookup & Drop Policy Switch Core (Bufferless) Output Scheduling Linecard Linecard Typically <5Tb/s aggregate capacity 19
Myths about CIOQ-based crossbar switches • “Input-queued crossbars have low throughput” • An input-queued crossbar can have as high throughput as any switch. • “Crossbars don’t support multicast traffic well” • A crossbar inherently supports multicast efficiently. • “Crossbars don’t scale well” • Today, it is the number of chip I/Os, not the number of crosspoints, that limits the size of a switch fabric. Expect 5Tb/s crossbar switches. 20
Myths about CIOQ-based crossbar switches (2) 4. “Crossbar switches can’t support delay/QoS guarantees” • With an internal speedup of 2, a CIOQ switch can precisely emulate a shared memory switch for all traffic. 21
What makes sense tomorrow? Single-stage (if possible): • Reduces complexity • Minimizes interconnect b/w • Minimizes power 23
Outline • Outline • Review: What is a Router? • The Evolution of Routers • Single-stage switching:The Fork-Join Router 24
Buffer MemoryHow Fast Can I Make a Packet Buffer? 5ns SRAM Buffer Memory 64-byte wide bus 64-byte wide bus Rough Estimate: • 5ns per memory operation. • Two memory operations per packet. • Therefore, maximum 51.2Gb/s. • In practice, closer to 40Gb/s. 25
Memory Bandwidth (to core) time Buffer MemoryIs It Going to Get Better? Specmarks, Memory size, Gate density time 26
Fork-Join RouterSponsored by NSF and ITRI How can we: • Increase capacity. • Reduce power per subsystem. While at the same time… • Keep the system simple. • Support line rates faster than memory bandwidth. • Support guaranteed services. Increase parallelism. Multiple racks. Single-stage buffering. Pkt-by-pkt load balancing. Hmmm….? 27
The Fork-Join Router Router 1 rate, R rate, R 1 1 2 rate, R rate, R N N k Bufferless 28
The Fork-Join Router • Advantages • Single-stage of buffering • kh a power per subsystem i • kh a memory bandwidth i • kh a fowarding table lookup rate i 29
The Fork-Join Router • Questions • Switching: What is the performance? • Forwarding Lookups: How do they work? 30
A Parallel Packet Switch Arriving packet tagged with egress port 1 Output Queued Switch rate, R rate, R 2 1 1 Output Queued Switch rate, R rate, R N N k Output Queued Switch 31
Performance Questions • Can it be work-conserving? • Can it emulate a single big output queued switch? • Can it support delay guarantees, strict-priorities, WFQ, …? 32
WorkConservation 1 Output Queued Switch R/k R/k 2 Output Queued Switch R/k R/k rate, R rate, R 1 1 R/k R/k k Output Queued Switch Output Link Constraint Input Link Constraint 33
5 1 1 4 3 2 1 Work Conservation 1 5 4 1 R/k R/k 4 1 2 2 R/k R/k 2 rate, R rate, R 1 1 3 R/k R/k k 3 Output Link Constraint 34
Work Conservation 1 S(R/k) Output Queued Switch S(R/k) rate, R rate, R S(R/k) S(R/k) 2 1 1 Output Queued Switch rate, R rate, R N N k Output Queued Switch S(R/k) S(R/k) 35
= ? Parallel Packet Switch 1 1 N N Precise Emulation of an Output Queued Switch Output Queued Switch 1 N N N 36
Parallel Packet SwitchTheorems • If S > 2k/(k+2) @ 2 then a parallel packet switch can be work-conserving for all traffic. • If S > 2k/(k+2) @ 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic. 37
Parallel Packet SwitchTheorems 3. If S > 3k/(k+3) @ 3 then a parallel packet switch can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic. 38
Parallel Packet SwitchTheorems 4. If S > 2 then a parallel packet switch with a small co-ordination buffer at rate R, can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic. 39
The Fork-Join Router • Questions • Switching: What is the performance? • Forwarding Lookups: How do they work? 40
The Fork-Join RouterLookahead Forwarding Table Lookups Packet tagged with egress port at next router Lookup performed in parallel at rate R/k 41
The Fork-Join Router Router 1 rate, R rate, R 1 1 2 rate, R rate, R N N k Expect >50Tb/s aggregate capacity 42
Conclusions • The main problems are power (supply and dissipation) and memory bandwidth. • Multi-stage switches solve the wrong problem. • Single-stage switches are here to stay. • Very high capacity single-stage electronic routers are feasible. 43