240 likes | 334 Views
Making Parallel Packet Switches Practical. Sundar Iyer, Nick McKeown (sundaes,nickm)@stanford.edu Departments of Electrical Engineering & Computer Science, Stanford University http://klamath.stanford.edu/pps. Motivation. To design and analyze:
E N D
Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown (sundaes,nickm)@stanford.edu Departments of Electrical Engineering & Computer Science, Stanford University http://klamath.stanford.edu/pps
Motivation To design and analyze: • an architecture of a very high capacity packet switch • in which the memories run slower than the line rate” [Ref: S. Iyer, A. Awadallah, N. McKeown, “Analysis of Packet Switch with Memories Running Slower than the Line Rate, Proc. Infocom, Tel Aviv, Mar 2000.] Stanford University
What limits capacity of packet switches today? • Memory bandwidth for packet buffers • Shared memory: B = 2NR • Input queued: B = 2R • Switch Arbitration • At the line rate R • Packet Processing • At the line rate R Stanford University
The building blocks we’d like to use: Large NxN Switch R R Slower NxN Switches R R How can we scale the capacity of switches? What we’d like: R R NxN R R Stanford University
Why this might be a good idea • Larger Capacity • Slower than the line rate • Buffering • Arbitration • Packet Processing • Redundancy Stanford University
R/k 1 R/k R R 2 R R … R R … k Observations and Questions • Random load-balancing: • It’s hard to predict system performance. • Flow-by-flow load-balancing: • Worst-case performance is very poor. • Can we do better? • What if we switch packet by packet? • Can we achieve 100% throughput • Can we give delay guarantees? Stanford University
Architecture of a PPS Definition: A PPS is comprised of multiple identical lower-speed packet-switches operating independently and in parallel. An incoming stream of packets is spread, packet-by-packet, by a demultiplexor across the slower packet-switches, then recombined by a multiplexor at the output. We call this “parallel packet switching” Stanford University
1 2 3 N=4 Architecture of a PPS Demultiplexor OQ Switch Multiplexor (sR/k) (sR/k) R R 1 1 Multiplexor Demultiplexor R R OQ Switch 2 2 Demultiplexor Multiplexor R R 3 OQ Switch Demultiplexor Multiplexor R R k=3 N=4 (sR/k) (sR/k) Stanford University
Output Queued Switch 1 1 R R 2 2 R R Internal BW = 2NR R R N N R R We will compare it to an OQ Switch • Why? • There is no internal contention • No queueing at the inputs • They give the minimum delay • They can give QoS guarantees Stanford University
Definition • Relative Queueing Delay • This is defined as the increased queueing delay faced by a cell in the PPS relative to the delay it receives in a shadow output queued switch • It includes the time difference attributed only due to queueing • A switch is said to emulate an OQ switch if the relative queueing delay is zero Stanford University
Shadow OQ Switch t’ t” C t R R C t’ C R R R R R R Yes No =? PPS C C C C A PPS which a bounded relative delay t” –t’ < Constant Stanford University
Problem Statement Redefined Motivation: “To design and analyze an architecture of a very high capacity packet switch in which the memories run slower than the line rate,whichpreserves the good properties of an OQ switch” This talk: Expanding the capacity of a FIFO packet switch, with a bounded relative queueing delay, using the PPS architecture. Stanford University
5 4 3 2 1 5 4 1 1 4 5 4 1 5 4 1 2 2 2 2 5 4 4 3 3 2 2 1 1 4 3 2 2 1 1 1 1 3 2 3 3 2 3 3 N=4 3 A Bad Scenario for the PPS Layer 1 R R 1 1 R/3 R R Layer 2 2 2 R R 3 R/3 Layer 3 R R N=4 R/3 Stanford University
Parallel Packet SwitchResult Theorem: • If S >= 2 then a PPS can emulate a FIFO OQ switch for all traffic patterns. Stanford University
Is this Practical? • Load Balancing Algorithm • Is Centralized • Requires N2 communication complexity • Ideally we want a distributed algorithm • Speedup • A speedup of 2 is required • We would ideally like no speedup Stanford University
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 N=4 Load Balancing in a PPS Layer 1 R R 3 2 1 1 R/3 R R Layer 2 2 3 2 1 2 R R 3 2 1 3 R/3 Layer 3 R R 3 2 1 N=4 R/3 Stanford University
12 9 8 7 2 1 13 8 5 4 3 1 4 3 2 1 14 13 10 6 3 16 14 11 9 6 2 8 6 5 7 15 12 10 7 12 11 10 9 15 11 5 16 14 13 15 16 4 Distribution of Cells from a Demultiplexor Demultiplexor: Input 1 • Cells from every input to a every output are sent to the center stage switches in a round robin manner FIFOs for all k=3 layers R/3 R R/3 R/3 “No more than 4 consecutive cells can go to the same FIFO i.e. center stage switch” Stanford University
Modification to the PPS • Relax the relative queueing bound • Allows a distributed load balancing arbiter • Run an independent load balancing algorithm on each demultiplexor • Eliminates N2 Communication Complexity • Keep small & bounded delay buffers at the demultiplexor • Eliminates speedup in the links between the demultiplexor and the center stage switches Stanford University
1 1 2 2 3 3 N=4 Cells as seen by the Multiplexor Layer 1 R R 1 1 1 1 3 2 1 1 R/3 R R Layer 2 2 3 2 1 2 2 2 2 2 R R 3 2 1 3 R/3 Layer 3 3 3 3 3 R R 3 2 1 N=4 R/3 Stanford University
Solution • Read • cells from the corresponding queues (which may be out of order) based on the arrival time from all center stage switches to maintain throughput • Introduce • a small and bounded re-sequencing buffer at the multiplexor to re-order cells and send them in sequence • Tolerate • a bounded delay relative to the shadow FIFO OQ switch Stanford University
Properties of the PPS Demultiplexor • Demultiplexors • Cells arrive at combined rate R over all k FIFOs • Each cell has a property: output • Cells to same output are inserted into the k FIFOs in RR. • Cells are written into each FIFO buffer at leaky bucket rate of less than R/Ck + N • Cells are read from each FIFOs at constant service rate R/k • Max delay faced by a cell is N internal time slots Stanford University
Relative Queueing Delay faced by a Cell • Demultiplexors • A maximum relative queueing delay of N internal time slots is encountered by a cell • Multiplexors • A maximum relative queueing delay of N internal time slots is encountered by a cell • Total Relative Queueing Delay • 2N time slots Stanford University
Buffered PPSResults • A PPS with a completely distributed algorithm and no speedup with a buffer of size Nk, can emulate a FIFO output queued switch for all traffic patterns within a relative queueing delay bound of 2N internal time slots I.e. 2Nk time slots. Stanford University
Conclusion • Its possible to expand the capacity of a FIFO packet switch using multiple slower speed packet switches. • There remain a couple of open questions • Making QoS practical. • Making multicasting practical. Stanford University