930 likes | 1.06k Views
Packet Scheduling/Arbitration in Virtual Output Queues: Maximal Matching Algorithms (Part II). Pointer Desynchronization. Performance: RRM < iSlip < FIRM Difference only in updating pointers Observation: iSlip and FIRM can effectively desynchronize their output pointers
E N D
Packet Scheduling/Arbitration in Virtual Output Queues: Maximal Matching Algorithms(Part II)
Pointer Desynchronization • Performance: RRM < iSlip < FIRM • Difference only in updating pointers • Observation: iSlip and FIRM can effectively desynchronize their output pointers • The best effect of pointer desynchronization is achieved if forced
Static Round Robin Matching (SRR):To Achieve FULL Desynchronization • Initialization. The input pointers are set to 0's. The output pointers are set to some initial pattern such that there is no duplication among the pointers. • The 3 steps of one iteration are: • Request. Each input sends a request to every output for which it has a queued cell. • Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output notifies each input whether or not its request was granted.The pointer to the highest priority element of the round-robin schedule is always incremented by one (modulo N) whether there is a grant or not.
SRR (Cont’d) • Accept. If an input receives a grant, it accepts the one that appears next in a fixed round-robin schedule starting from the highest priority element. The pointer to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the accepted one. • In DSRR (Improved version of SRR), input pointers are also desynchronized. • Rotating DSRR (RDSRR): • Unfairness among inputs under special traffic model. • Outputs searching in clockwise and anti-clockwise directions alternatively to decide grants.
32x32 switch under uniform traffic 70 iSlip 60 FIRM SRR DSRR 50 RDSRR 40 Relative average delay 30 20 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized load Simulation Results
32x32 switch under uniform bursty traffic 45 40 iSlip FIRM SRR 35 DSRR RDSRR 30 Relative average delay 25 20 15 10 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized load Simulation Results
32x32 switch under hotspot traffic 4 10 iSlip FIRM SRR 3 10 DSRR RDSRR Relative average delay 2 10 1 10 0 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 Normalized load Simulation Results
32x32 switch under unbalanced traffic 4 10 iSlip FIRM SRR 3 10 DSRR RDSRR Average delay 2 10 1 10 0 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized load Simulation Results
Stability Property • A VOQ switch is considered stable if it approaches a steady state where the expected length of each VOQ is bounded. If it is stable, 100% throughput can be achieved under any admissible traffic pattern. • RDSRR is more stable than iSlip and FIRM under various traffic patterns.
32x32 switch under unbalanced traffic 1.01 1 0.99 0.98 Throughput 0.97 iSlip FIRM 0.96 RDSRR Output 0.95 0.94 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Normalized load Stability Property (Cont’d)
3-Phase & 2-Phase Algorithms • iSlip & FIRM are 3-phase algorithms: Request-Grant-Accept • DRRM is 2-phase algorithm: Grant-Accept • Each input sends one grant • Each output sends one accept • 2-FIRM is the 2-phase version of FIRM
32x32 switch under uniform traffic 70 iSlip 60 DRRM FIRM 2-FIRM 50 40 Relative average delay 30 20 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized load 3-Phase & 2-Phase Algorithms
32x32 switch under hotspot traffic 4 10 iSlip DRRM 3 10 FIRM 2-FIRM 2 10 Relative average delay 1 10 0 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 Normalized load 3-Phase & 2-Phase Algorithms
3-Phase & 2-Phase Algorithms • In general case, the traffic model changes from time to time • When the temporary non-uniformity is on the input side, 3-phase scheme performs better • When the temporary non-uniformity is on the output side, 2-phase scheme performs better
2-stage Maximum Size Matching Algorithm: Description • The 2-stage algorithm works in the following way: 1. The pointers at both input and output sides are kept fully desynchronized. 2. In each iteration, there are 3 steps: Step 1:Each input sends a request to every output for which it has a queued cell. Step 2:Each input selects one VOQ to send grant that appears next starting from its highest priority output. Each output selects one request received in step 1 to send grant that appears next starting from its highest priority input. OutputCount = number of outputs receiving grants from inputs. InputCount = number of inputs receiving grants from outputs.
2-stage Maximum Size Matching Algorithm: Description • Step 3: If OutputCount ? InputCount, each output selects one among the grants received in step 2 which appears next starting from its highest priority input and sends accept. • Else, each input selects one among the grants received in step 2 which appears next starting from its highest priority output and sends accept. • In simple words, this algorithm will decide in each time slot whether to use 2-phase or 3-phase scheme based on which one can make more matches.
1st group of inputs 2nd group of inputs 2 physical lines from comparator Comparator Output Counter Input Counter State of Input Queues (N2 bits) Decision Register 1 1 2 2 N N Grant Arbiters Accept Arbiters 2-stage Maximum Size Matching Algorithm: Hardware Implementation
Performance Evaluation: Simulation Study Uniform Traffic
Performance Evaluation: Simulation Study 2-stage over iSlip SRR over iSlip
Performance Evaluation: Simulation Study Bursty Traffic
Performance Evaluation: Simulation Study 2-stage over iSlip SRR over iSlip
Performance Evaluation: Simulation Study Hotspot Traffic
Performance Evaluation: Simulation Study 2-stage over iSlip SRR over iSlip
Performance Evaluation: Simulation Study Unbalanced Traffic
Performance Evaluation: Simulation Study 2-stage over iSlip SRR over iSlip
A new algorithm – RDESRR • Real Desynchronized Round Robin Model (RDESRR) • Based on 2 phases RRM model (Request and Grant) • Add a small share memory that each outputs can read/write (called Share Bits) • The size of the memory is 1 bit per input • If the bit is set, the corresponding input has already granted by an output • If the bit is not set, the output may grant to corresponding input port
0 1 2 3 RDESRR Conceptual model Share Bits 3 0 2 1 0 0 3 0 2 1 1 1 3 0 2 1 2 2 3 0 2 1 3 3
RDESRR model • 2 phases only • Request. Each input sends a request to every output for which it has a queued cell. • Grant. If an output receives any requests, it chooses the one that appears next in a fixed, round-robin schedule starting from the highest priority element. The output check the corresponding bit is set or not, if not set, the output will set the bit and notifies the input its request was granted. Otherwise, the output will look for next request until all requests has gone through. The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input. If no request is received, the pointer stays unchanged.
RDESRR Demo - Request Step 1: Request 0 0 1 1 2 2 3 3
Share Bits 0 3 0 2 1 1 3 0 2 1 2 3 0 2 1 3 0 2 1 3 RDESRR Demo – Add a share memory in Output • Add a small share memory that each outputs can read/write (called Share Bits) Step 2: Grant 0 0 1 1 2 2 3 3
0 3 0 2 1 1 3 0 2 1 2 3 0 2 1 3 0 2 1 3 RDESRR Demo – Output check the share bits • The output check the corresponding bit is set or not Step 2: Grant Share Bits 0 0 1 1 2 2 3 3
0 3 0 2 1 1 3 0 2 1 2 3 0 2 1 3 0 2 1 3 RDESRR Demo – When share bit is occupied • if not set, the output will set the bit and notifies the input its request was granted • The share bit is First Come First Serve Step 2: Grant Share Bits 0 0 1 1 2 2 3 3
0 3 0 2 1 1 3 0 2 1 2 3 0 2 1 3 0 2 1 3 RDESRR Demo – Output looks for next request • If set, the output will look for next request until all requests have gone through Step 2: Grant Share Bits 0 0 1 1 2 2 3 3
0 3 0 2 1 1 3 0 2 1 2 3 0 2 1 3 0 2 1 3 RDESRR Demo – All share bits are allocated • Fully allocate the share bit will result for fully grant all input request Step 2: Grant Share Bits 0 0 1 1 2 2 3 3
0 3 0 2 1 0 0 1 3 0 2 1 1 1 2 3 0 2 1 2 2 3 0 2 1 3 3 3 RDESRR Demo – Pointer update/Share bit reset • The pointer gi to the highest priority element of the round-robin schedule is incremented (modulo N) to one location beyond the granted input • If no request is received, the pointer stays unchanged • Share bits are also reset Share Bits
SIM Results • Run the test for 32x32 port in SIM using –l 1000000
Input QueueingLongest Queue First orOldest Cell First { = } Queue Length Weight 100% Waiting Time 1 1 1 1 1 10 2 2 2 2 1 w e i g h t M m m a x i u 3 3 3 3 1 10 4 4 4 4 1
Non-uniform traffic Uniform traffic Avg Occupancy Avg Occupancy VOQ # VOQ # Input QueueingWhy is serving long/old queues better than serving maximum number of queues? • When traffic is uniformly distributed, servicing themaximum number of queues leads to 100% throughput. • When traffic is non-uniform, some queues become longer than others. • A good algorithm keeps the queue lengths matched, and services a large number of queues.
Maximum/Maximal Weight Matching • 100% throughput for admissible traffic (uniform or non-uniform) • Maximum Weight Matching • OCF (Oldest Cell First): w=cell waiting time • LQF (Longest Queue First):w=input queue occupancy • LPF (Longest Port First):w=QL of the source port + Sum of QL form the source port to the destination port • Maximal Weight Matching (practical algorithms) • iOCF • iLQF • iLPF (comparators in the critical path of iLQF are removed )
Maximal Weight Matching Algorithms: iLQF • Request. Each unmatched input sends a request word of width bits to each output for which it has a queued cell, indicating the number of cells that it has queued to that output. • Grant. If an unmatched output receives any requests, it chooses the largest valued request. Ties are broken randomly. • Accept. If an unmatched input receives one or more grants, it accepts the one to which it made the largest valued request. Ties are broken randomly.
Maximal Weight Matching Algotithms: iLQF • The i-LQF algorithm has the following properties: • Property 1. Independent of the number of iterations, the longest input queue is always served. • Property 2. As with i-SLIP, the algorithm converges in at most logN iterations. • Property 3. For an inadmissible offered load, an input queue may be starved.
Maximal Weight Matching Algotithms: iOCF • The i-OCF algorithm works in similar fashion to iLQF, and has the following properties: • Property 1. Independent of the number of iterations, the cellthat has been waiting the longest time in the input queues (it must at the head of the queue) • Property 2. As with i-LQF, the algorithm converges in at most logN iterations. • Property 3. No input queue can be starved indefinitely. • Property 4. It is difficult to keep time stamps on the cells.
iLPF - Implementation Complicated hardware
Other research efforts • Packet-based arbitration • Exhaustive-based arbitration • Numerous other efforts
Packet Scheduling/Arbitration in Virtual Output Queues:Randomized Algorithmsand Others
Scheduler Crossbar 1,1 inputs 1 i,j . . . . N N,N outputs 1 . . . . N Input-Queued Packet Switch (i i,j < 1 ; j i,j < 1) Xi,j
1 0 0 1 1 1 1 1 0 Bipartite Graph and Matrix inputs 1 2 3 outputs 1 2 3
Stability of Scheduling Definition:Let Xi,j(t) be the number of packets queued at input i for output j at time-slot t. Then an algorithm is stable iff: