380 likes | 523 Views
The Fork-Join Router. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm. Outline. Quick Background on Packet Switches What’s the problem? “What if data rates exceed memory bandwidth?”
E N D
The Fork-Join Router Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University nickm@stanford.edu http://www.stanford.edu/~nickm
Outline • Quick Background on Packet Switches • What’s the problem? “What if data rates exceed memory bandwidth?” • The Fork-Join Router • Parallel Packet Switches
Buffer Memory CPU CPU DMA DMA DMA Line Interface Line Interface Line Interface Memory MAC MAC MAC First Generation Packet Switches Fixed length “DMA” blocks or cells. Reassembled on egress linecard Shared Backplane Line Interface Fixed length cells or variable length packets
DMA DMA DMA Line Card Line Card Line Card Local Buffer Memory Local Buffer Memory Local Buffer Memory MAC MAC MAC Second Generation Packet Switches Buffer Memory CPU
Third Generation Packet Switches Switched Backplane Line Card CPU Card Line Card Local Buffer Memory Local Buffer Memory Line Interface CPU Memory MAC MAC
1+1 = 2 operations per cell time N+N = 2N operations per cell time Shared Memory Two Basic Techniques Input-queued Crossbar
Shared MemoryThe Ideal A D T K I P Z Z Z Numerous work has proven and made possible: • Fairness • Delay Guarantees • Delay Variation Control • Loss Guarantees • Statistical Guarantees A A A A A A A A A Z Z Z A A D A B H X F Z
= ? Combined Input-Output Queued Switch Scheduler Precise Emulation of an Output Queued Switch Output Queued Switch 1 N N N
Result Theorem: A speedup of 2-1/N is necessary and sufficient for a combined input- and output-queued switch to precisely emulate an output-queued switch for all traffic. Joint work with Balaji Prabhakar at Stanford.
Outline • Quick Background on Packet Switches • What’s the problem? “What if data rates exceed memory bandwidth?” • The Fork-Join Router • Parallel Packet Switches
Buffer MemoryHow Fast Can I Make a Packet Buffer? 5ns SRAM Buffer Memory 64-byte wide bus 64-byte wide bus Rough Estimate: • 5ns per memory operation. • Two memory operations per packet. • Therefore, maximum 51.2Gb/s. • In practice, closer to 40Gb/s.
Memory Bandwidth (to core) time Buffer MemoryIs It Going to Get Better? Specmarks, Memory size, Gate density time
Optical Physical Layers……are Going to Make Things “Worse” DWDM: • More l’s per fiber a more “ports” per switch. • # ports: 16, …, 1000’s. Data rate: • More b/s per la higher capacity. • Data rates: 2.5Gb/s, 10Gb/s, 40Gb/s, 160Gb/s, …
Approach #1: Ping-pong Buffering Buffer Memory 64-byte wide bus 64-byte wide bus Buffer Memory
Approach #1: Ping-pong Buffering Buffer Memory 64-byte wide bus 64-byte wide bus Buffer Memory Memory bandwidth doubled to ~80 Gb/s
Approach #2: Multiple Parallel Buffersaka Banking, Interleaving Buffer Memory Buffer Memory Buffer Memory Buffer Memory
Outline • Quick Background on Packet Switches • What’s the problem? “What if data rates exceed memory bandwidth?” • The Fork-Join Router • Parallel Packet Switches
The Fork-Join Router Router 1 rate, R rate, R 1 1 2 rate, R rate, R N N k Bufferless
The Fork-Join Router • Advantages • kh a memory bandwidth i • kh a lookup/classification rate i • kh a routing/classification table size i • Problems • How to demultiplex prior to lookup/classification? • How does the system perform/behave? • Can we predict/guarantee performance?
Outline • Quick Background on Packet Switches • What’s the problem? “What if data rates exceed memory bandwidth?” • The Fork-Join Router • Parallel Packet Switches
A Parallel Packet Switch 1 Output Queued Switch rate, R rate, R 2 1 1 Output Queued Switch rate, R rate, R N N k Output Queued Switch
Parallel Packet SwitchQuestions • Can it be work-conserving? • Can it emulate a single big output queued switch? • Can it support delay guarantees, strict-priorities, WFQ, …? • What happens with multicast?
Parallel Packet SwitchWork Conservation 1 R/k R/k 2 R/k R/k rate, R rate, R 1 1 R/k R/k k Output Link Constraint Input Link Constraint
5 1 1 4 3 2 1 Parallel Packet SwitchWork Conservation 1 5 4 1 R/k R/k 4 1 2 2 R/k R/k 2 rate, R rate, R 1 1 3 R/k R/k k 3 Output Link Constraint
Parallel Packet SwitchWork Conservation 1 S(R/k) Output Queued Switch S(R/k) rate, R rate, R S(R/k) S(R/k) 2 1 1 Output Queued Switch rate, R rate, R N N k Output Queued Switch S(R/k) S(R/k)
= ? Parallel Packet Switch 1 1 N N Precise Emulation of an Output Queued Switch Output Queued Switch 1 N N N
Parallel Packet SwitchTheorems • If S > 2k/(k+2) @ 2 then a parallel packet switch can be work-conserving for all traffic. • If S > 2k/(k+2) @ 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic.
Parallel Packet SwitchTheorems 3. If S > 3k/(k+3) @ 3 then a parallel packet switch can be precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.
Expansion factor required = 2-1/N An asideUnbuffered Clos Circuit Switch
O1 O2 O3 Ox b I1 I2 I3 Ix <= min(R,m) entries in each row <= min(R,m) entries in each column Clos Network a m { }m b I1 O1 }m m { IX OX c R middle stage switches
Clos Network O1 O2 O3 Ox a b m { }m b I1 O1 I1 I2 I3 Ix }m m { IX OX c R middle stage switches • <= min(R,m) entries in each row • <= min(R,m) entries in each column Define: UIL(Ii) = used links at switch Ii to connect to middle stages. UOL(Oi) = used links at switch Oi to connect to middle stages. If we wish to connect Ii to Oi: When adding connection: |UIL(Ii)| <= m-1 and |UOL(Oi)| <= m-1 Worst-case: |UIL(Ii) U UOL(Oi)| = 2m -2 Therefore, if R >= 2m-2 there are always enough middle stages.
Expansion factor required = 2-1/N An asideUnbuffered Clos Circuit Switch Expansiona 2 - 4/(k+2)
Fork-Join Router ProjectWhat’s next? • Theory: • Extending results to distributed algorithms. • Extending results to multicast. • Implementation/Prototyping: • Under discussion...