290 likes | 408 Views
IP routers with memory that runs slower than the line rate. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm. Outline. Trends in packet switch design Additional problem:
E N D
IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm 1
Outline • Trends in packet switch design • Additional problem: “Data rates may soon exceed memory bandwidth” • The Fork-Join Router & Parallel Packet Switches 2
First Packet SwitchesShared Memory Numerous work has proven and made possible: • Fairness • Delay Guarantees • Delay Variation Control • Loss Guarantees • Statistical Guarantees Output 1 Input 1 Input 2 Output 2 Large, single dynamically allocated memory buffer: N writes per “cell” time N reads per “cell” time. Limited by memory bandwidth. Input N Output N 3
Rate of writes/reads determined by switchfabric speedup Later Packet SwitchesSingle-stage crossbar with CIOQ and VOQs Virtual Output Queues 1 read per “cell” time 1 write per “cell” time Lookup & Drop Policy Output Scheduling Switch Fabric Lookup & Drop Policy Output Scheduling Switch Arbitration Lookup & Drop Policy Switch Core (Bufferless) Output Scheduling Linecard Linecard 4
Myths about CIOQ-based crossbar switches • “Input-queued crossbars have low throughput” • An input-queued crossbar can have as high throughput as any switch. • “Crossbars don’t support multicast traffic well” • A crossbar inherently supports multicast efficiently. • “Crossbars don’t scale well” • Today, it is the number of chip I/Os, not the number of crosspoints, that limits the size of a switch fabric. Expect 5Tb/s crossbar switches. 5
Myths about CIOQ-based crossbar switches (2) 4. “Crossbar switches can’t support delay/QoS guarantees” • With an internal speedup of 2, a CIOQ switch can (in theory) precisely emulate a shared memory switch for all traffic. 6
Output 1 Input 1 Input 2 Output 2 Input N Output N Summary of trend Higher Capacity • Multistage: • Clos • Banyan • Toroidal… 1 Switch Fabric Switch Arbitration 2 Less frequent arbitration Limited by: Memory bandwidth ~50Gb/s Limited by: Per-cell arbitration Power ~5Tb/s 8
Buffer MemoryHow Fast Can I Make a Packet Buffer? 10ns on-chip DRAM Buffer Memory External Line e.g. OC768c 64-byte wide bus 64-byte wide bus SwitchFabric Rough Estimate: • 10ns per memory operation. • Two memory operations per packet. • Therefore, maximum ~26Gb/s. 9
How can we make routers with 40Gb/s, 160Gb/s,… interfaces? 10
Output 1 Input 1 Input 2 Output 2 Input N Output N Higher capacity and higher linerates Higher capacity 1 Multistage Switch Fabric 2 Less frequent arbitration Switch Arbitration 3 More parallelism: Fork-Join Router Limited by: Memory bandwidth ~50Gb/s Limited by: Per-cell arbitration Power ~5Tb/s Higher Linerates 11
Fork-Join Router How can we: • Increase capacity. • Reduce power per subsystem. While at the same time… • Keep the system simple. • Support line rates faster than memory bandwidth. • Provide delay guarantees. Increase parallelism. Multiple racks. Single-stage buffering. Pkt-by-pkt load balancing. Hmmm….? 12
The Fork-Join Router Router 1 rate, R rate, R 1 1 2 rate, R rate, R N N k Bufferless 13
The Fork-Join Router • Advantages • Single-stage of buffering • kh a power per subsystem i • kh a memory bandwidth i • kh a fowarding table lookup rate i 14
The Fork-Join Router • Questions • Switching: What is the performance? • Forwarding Lookups: How do they work? 15
A Parallel Packet Switch Arriving packet tagged with egress port 1 Output Queued Switch rate, R rate, R 2 1 1 Output Queued Switch rate, R rate, R N N k Output Queued Switch 16
Performance Questions • Can it be work-conserving? • Can it emulate a single big output queued switch? • Can it support delay guarantees, strict-priorities, WFQ, …? 17
WorkConservation 1 Output Queued Switch R/k R/k 2 Output Queued Switch R/k R/k rate, R rate, R 1 1 R/k R/k k Output Queued Switch Output Link Constraint Input Link Constraint 18
5 1 1 4 3 2 1 Work Conservation 1 5 4 1 R/k R/k 4 1 2 2 R/k R/k 2 rate, R rate, R 1 1 3 R/k R/k k 3 Output Link Constraint 19
Work Conservation 1 S(R/k) Output Queued Switch S(R/k) rate, R rate, R S(R/k) S(R/k) 2 1 1 Output Queued Switch rate, R rate, R N N k Output Queued Switch S(R/k) S(R/k) 20
= ? Parallel Packet Switch 1 1 N N Precise Emulation of an Output Queued Switch Output Queued Switch 1 N N N 21
Parallel Packet SwitchTheorems • If S > 2k/(k+2) @ 2 then a parallel packet switch can be work-conserving for all traffic. • If S > 2k/(k+2) @ 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic. 22
Parallel Packet SwitchTheorems 3. If S > 3k/(k+3) @ 3 then a parallel packet switch can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic. 23
Parallel Packet SwitchTheorems 4. If S >= 1 then a parallel packet switch with a small co-ordination buffer at rate R, can precisely emulate a FCFS switch for all traffic. 24
Co-ordination buffers 1 Size Nk Size Nk R/k Output Queued Switch R/k rate, R rate, R R/k R/k 2 Output Queued Switch rate, R rate, R k Output Queued Switch R/k R/k 25
Parallel Packet SwitchTheorems 5. If S > 2 then a parallel packet switch with a small co-ordination buffer at rate R, can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic. 26
The Fork-Join Router • Questions • Switching: What is the performance? • Forwarding Lookups: How do they work? 27
The Fork-Join RouterLookahead Forwarding Table Lookups Packet tagged with egress port at next router Lookup performed in parallel at rate R/k 28
The Fork-Join Router Router 1 rate, R rate, R 1 1 2 rate, R rate, R N N k • Possibly >100Tb/s aggregate capacity • Linerates in excess of 100Gb/s 29