1 / 54

Switch Architectures

Switch Architectures. Input Queued, Output Queued, Combined Input and Output Queued. Outline. I. Introduction II. System Model III. The Least Cushion First/Most Urgent First Algorithm IV. Conclusion. Ⅰ. Introduction. Exponential growth of Internet traffic demands large scale switches

Download Presentation

Switch Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Switch Architectures Input Queued, Output Queued, Combined Input and Output Queued

  2. Outline • I. Introduction • II. System Model • III. The Least Cushion First/Most Urgent First Algorithm • IV. Conclusion

  3. Ⅰ. Introduction • Exponential growth of Internet traffic demands large scale switches • Common Switch Architectures • Output Queued • High performance • Easier to provide QoS guarantee • Has serious scaling problem • Input Queued • More scalable • Suffers from HOL blocking • Virtual Output Queues can improve performance • Difficult to provide QoS guarantee

  4. Output Queued-Shared Bus Output Port Input Port 1 1 1 2 2 3 4 3 4

  5. Output Queued-Shared Memory Output Port Input Port Memory 1 1 2 2 3 3 4 4

  6. Input port: 1 2 3 4 OUTPUT PORT: 1 2 3 4 Input Queued

  7. Input port: For output port: 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 2 3 4 OUTPUT PORT: 1 2 3 4 Input Queued with VOQ

  8. Ⅰ. Introduction Memory BW requirements for three common switch architectures: S :link speed N:switch size (N×N) • Input queueing is necessary ! • Can speedup the switch to improve performance CIOQ switch

  9. Ⅰ. Introduction Matching Algorithms for Performance Improvement: matching

  10. Input 1 Output 1 CIOQ Switch . . . . . . Input N Output N Identical Input Traffic Output 1 Input 1 Emulated . . . . . . OQ Switch Output N Input N Identical Departure Pattern Ⅰ. Introduction Exact Emulation: under identical input traffic, the departure times of every cell from both CIOQ switch and OQ switch are identical.

  11. Ⅰ. Introduction • We propose a new scheduling algorithm called the least cushion first / most urgent first (LCF/MUF) algorithm • O(N) complexity with parallel comparators • Exactly emulates an OQ switch with a speedup of 2 times • No constraint on service discipline

  12. Switching Fabrics Speedup=2 Ⅱ. SystemModel

  13. Ⅱ. System Model • Switch fabric is speeded up by a factor of 2 • There are 2 scheduling phases in slot k, referred to as phase k.1 and phase k.2 • A cell delivered to its destined output port in phase k.1 can be transmitted out of the output port in the same slot (i.e., cut through) • A cell delivered in phase k.2 can only be transmitted in slot k+1 or after

  14. Ⅱ. System Model

  15. Ⅲ. The Least Cushion First / Most Urgent First Algorithm • Let denote a cell at input port i destined to output port j • Definition 1: The cushion of cell : • The number of cells residing in output port j which will depart the emulated OQ switch earlier than cell • Definition 2: The cushion between input port i and output port j: • The minimum of for all cells at input port i destined to output port j • If there is no cell destined to output port j, then is set to

  16. Ⅲ. The Least Cushion First / Most Urgent First Algorithm • Definition 3: The scheduling matrix of an NxN switch is an NxN square matrix whose (i,j)th entry equals • Definition 4: The input thread of cell at input port i: • The set of cells at input port i which has a cushion smaller than or equal to except cell itself • Let denote the size of

  17. Ⅲ. The Least Cushion First / Most Urgent First Algorithm

  18. Ⅲ. The Least Cushion First / Most Urgent First Algorithm • LCF / MUF Algorithm • Step 1: • Select the (i,j)th entry which satisfies (Least Cushion First). If the selected entry is then stop. • If there are more than one entries with the least cushion residing in different columns, then select arbitrarily a column (i.e., an output port). • For the selected column, say, column j, determine row i which has the most urgent cell among all cells at all input ports (Most Urgent First).

  19. Ⅲ. The Least Cushion First / Most Urgent First Algorithm • LCF / MUF Algorithm • Step 2: • Eliminate the ith row and the jth column (i.e., match output port j to input port i) of the scheduling matrix. • If the reduced matrix becomes null, then stop. Otherwise, use the reduced matrix and go to Step 1. • Consider for example the scheduling matrix given in page 13

  20. Ⅳ. Conclusion • We propose a new scheduling algorithm - the least cushion first /most urgent first algorithm • Exactly emulates an OQ switch • No constraint on service discipline • Implement issues of the LCF / MUF algorithm • A switch has to know the cushions of all cells and the relative departure order of cells destined to the same output port • It could be difficult to obtain these information for a dynamic priority assignment scheme (e.g. WFQ) • Feasible for static priority assignment schemes

  21. Outline • Systolic Array • Binary Heap • Pipelined Heap • Hardware Design

  22. The Systolic Array Priority Queue Highest value New value Block n Block 3 Block 2 Block 1 Permanent Data Register Temporary Register NON-INCREASING PRIORITY VALUES n = 1000 Hardware required: 1000 comparators, 2000 registers. Performance: constant time.

  23. The Binary Heap Priority Queue 1 16 2 3 14 10 4 5 6 7 4 7 8 3 8 9 10 11 12 3 2 3 5 7 1 2 3 4 5 6 7 8 9 10 11 12 13 15 14 VALUE 16 14 10 4 7 8 3 2 3 3 5 7 n =1000 Hardware required: 1 comparator, 1 register, 1 SRAM. Performance: O(log n).

  24. The Pipelined-Heap • Modified binary heap data structure • Constant-time operation. Similar to the Systolic Array. • Good hardware scalability. Similar to the Binary Heap.

  25. Binary Array(B) Token Array(T) operation value position 1 Level 1 16 2 3 Level 2 14 10 4 5 6 7 Level 3 4 7 7 3 8 9 10 11 12 13 14 15 Level 4 2 1 5 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 value 16 14 10 4 7 7 3 2 1 5 8 capacity 4 1 3 1 0 1 2 0 1 0 0 1 0 1 1 P-heap Data Structure (B,T)

  26. The Enqueue (Insert) Operation operation value position 1 operation value position 1 enq 9 1 16 16 2 3 2 3 14 10 enq 9 2 14 10 4 5 6 7 4 5 6 7 8 7 3 8 7 3 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 2 4 5 2 4 5 (a) local-enqueue(1) (b) local-enqueue(2)

  27. Enqueue (contd) operation value position 1 operation value position 1 16 16 2 3 2 3 14 10 14 10 4 5 6 7 4 5 6 7 8 9 3 enq 9 5 8 7 3 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 enq 7 10 2 4 5 2 4 5 (c) local-enqueue(3) (d) local-enqueue(4) operation value position 1 16 2 3 14 10 4 5 6 7 8 9 3 8 9 10 11 12 13 14 15 2 4 7 5 (e)

  28. operation value position 1 16 2 3 14 10 4 5 6 7 8 7 3 8 9 10 11 12 13 14 15 2 4 5 (a) The Dequeue (Delete) Operation operation value position 1 deq 1 2 3 14 10 4 5 6 7 8 7 3 8 9 10 11 12 13 14 15 2 4 5 (b) local-dequeue(1)

  29. Dequeue (contd) 1 operation value position 1 operation value position 14 14 2 3 2 3 8 10 deq 2 10 4 5 6 7 4 5 6 7 deq 4 7 3 8 7 3 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 2 4 5 2 4 5 (d) local-dequeue(3) (c) local-dequeue(2) operation value position 1 14 2 3 8 10 4 5 6 7 4 7 3 8 9 10 11 12 13 14 15 2 5 (e)

  30. Pipelined Operation level level 1 1 2 2 3 3 4 4 5 5 6 6 level level 1 1 2 2 3 3 4 4 5 5 6 6

  31. Hardware Requirements • log N SRAMs represent the Binary Array B, N = size of the P-heap . • log N registers represent the Token Array T. • log N comparators required, one for each level of the P-heap.

  32. Binary Heap 1 Left(i) = 2*i Right(i) = 2*i + 1 Parent(i) = i / 2 A[i] >= A[Left(i)] A[i] >= A[Right(i)] 16 2 3 11 12 4 5 6 8 11 9 viewed as a binary tree 1 2 3 4 5 6 16 11 12 8 11 9 viewed as an array

  33. Binary Heap : Insert Operation 1 1 16 16 2 3 2 3 11 12 11 14 4 5 6 7 4 5 6 7 8 10 9 14 8 10 9 12 viewed as a binary tree viewed as a binary tree 1 2 3 4 5 6 7 1 2 3 4 5 6 7 16 11 12 8 10 9 14 16 11 14 8 10 9 12 viewed as an array viewed as an array

  34. Binary Heap : Delete Operation 1 1 1 16 16 12 14 2 3 2 3 2 3 11 14 11 14 11 12 4 5 6 7 4 5 6 4 5 6 8 10 9 12 8 10 9 8 10 9 viewed as a binary tree viewed as a binary tree viewed as a binary tree 1 2 3 4 5 6 7 1 2 3 4 5 6 1 2 3 4 5 6 16 11 14 8 10 9 12 12 11 14 8 10 9 14 11 12 8 10 9 viewed as an array viewed as an array viewed as an array

  35. Binary Heap Operations • Both insert and delete are O(log N) operations (i.e. number of levels in the tree) • 2*i can be implemented as left shift • i / 2 can be implemented as right shift

  36. Some scheduling algorithm • Outline • PIM • RRM • iSLIP (Better solution)

  37. Scheduling Algorithms • When we use a crossbar switch, we require a scheduling algorithm that match inputs with outputs. • This is equivalent to find a bipartite matching on a graph with N vertices. • The algorithm configures the fabric during each cell time and decides which inputs will be connected to which outputs.

  38. Output side Input side P(1,1)=1 P(1,2)=3 Crossbar Switch P(3,2)=3 P(3,4)=1 P(4,4)=2 Scheduling packets • For Example P( input #, output #) = order to leave Scheduling Algorithm need to decide the path and order of packets through crossbar switch

  39. High performance systems • Usually, we design algorithm with the following properties: • High Throughput • Starvation Free • Fast • Simple to Implement

  40. Parallel Iterative Matching (PIM) • PIM has three steps to implement • Step1 : Request • Step2 : Grant • Step3 : Accept • Each decision is made randomly.

  41. The mathematics model of algorithm • We can assume that • Every input in[i] maintains the following state information: • Table Ri[0] … Ri[N-1], where Ri[k] = 1, if In[i] has a request for Out[k] (0, otherwise) • Table Gdi[0] … Gdi[N-1], where Gdi[k] = 1, if In[i] receives a grant from Out[k] (0, otherwise) • Variable Ai, where Ai = k, if In[i] accepts the grant from Out[k] (-1, if no output is accepted).

  42. The mathematics model (cond’t) • Every output Out[k] maintains the following state information: • Table Rdk[0] … Rdk[N-1], where Rdk[i] = 1, if Out[k] receives a request from In[i] (0, otherwise) • Variable Gk, where Gk = i, if Out[k] sends a grant to In[i] (-1, if no input is granted) • Variable Adk, where Adk = 1, if the grant from Out[k] is accepted. (0, otherwise).

  43. The model of PIM • Therefore, we can represent PIM algorithm as

  44. P(1,1)=1 P(1,2)=3 P(3,2)=3 P(3,4)=1 P(4,4)=2 (a) (b) (c) An example of PIM algorithm Second iteration Request Grant Accept

  45. Problems with PIM • Hard to implement randomness in hardware • Unfairness occurs among connections under oversubscribed situation • Throughput is limited to approximately 63% for a single iteration

  46. λ1,1=1 μ1,1=1/4 λ1,2=1 μ 1,2=3/4 μ 2,1=3/4 λ2,1=1 The unfairness problem

  47. Round-Robin Matching Algorithm (RRM) • Use rotating priority to match inputs and outputs • Need a pointer gi to identify the highest priority element • Apply rotating priority on both inputs and outputs

  48. The model of RRM

  49. a1 g2 4 4 4 1 1 1 4 1 2 P(1,1)=1 P(1,2)=3 3 3 3 2 2 2 P(3,2)=3 P(3,4)=1 P(4,4)=2 g4 (a) (b) (c) RRM scheduling

  50. λ1,1= λ1,2 =1 μ1,1= μ1,2=1/4 λ2,1= λ 2,2=1 μ 2,1= μ 2,2=1/4 Synchronization Problem • When an output receives a request, the output should choose an input to grant and gi must vary to a new value • For example Efficiency = 50%

More Related