A. Cassinelli, A. Goulet, M. Ishikawa

Load-balanced optical packet switch using two-stage time-slot interchangers IEICE 2004 A. Cassinelli, A. Goulet, M. Ishikawa University of Tokyo, Department of Information Physics and Computing M. Naruse, F. Kubota National Institute of Information and Communications Technology

Plan of the presentation I. Introduction: • The Ideal Packet Switch and Our goals • Some Assumptions • The BVN Switch and its Scheduling Complexity Bottleneck II. The Load Balanced BVN Switch - LB stage simplifies scheduling of a BVN switch non-uniformities. - makes the switch performance independent from traffic III. Optical implementation of the Load Balancing Switch Load balancing architectre allows for simple, deterministic buffer schedule, ideal for optical implementation using fiber delay-line based TSI... III.1 Single Stage TSI and resulting LBS performance III.2 Double Stage TSI and resulting LBS performance IV. Conclusion and Further Research V. References

I. Introduction The ideal packet switch should: • Provide high throughput for any kind of traffic • Be stable • queues in buffer should remain bounded • Have low delays • Manage priority traffic • provide throughput guarantees for some ports • provide reduced delays for such traffic Our goal here: • develop an “ideal” optical packet switch for TDM, possibly for asynchronous optical networks (WDM remains an additional dimension). • do that without using non-mature RAM optical memories - only delay lines.

BVN scheduler Some preliminary assumptions • Time is “slotted”, packets have the same size and are “aligned” • At most one packet arrives per time slot at each input line (no WDM) • The output lines are not overloaded (traffic is “admissible”) The BVN switch Given these assumptions, a good switch candidate is the so called “Birkhoff-Von Newmann Switch”, first proposed by Chang [1999], based on the works of Birkhoff [1946] and Von-Newmann [1953]. • Essentially, it is a Crossbar Switch that: • has Virtual Output Queues (VOQ) to alleviate HOL blocking, • Relies on an efficient but rather time-consumingO(N4.5) scheduling algorithm in order to find the appropiate sequence of crossbar states to service the VOQs, avoid their saturation and reduce packet delay.

... but today there is an additional constraint: given the speed of today networks, schedulers are running short of time for computation!! (from McKeown – Stanford University) Clock cycles allowed to schedule a single packet (40 Gb/s =>  11 ns per ATM packet, or 10 cycles in a 1 GHz computer...) So, the “ideal switch” must also rely on a scheduling algorithm with very low computational complexity.

There is hope... • It is relatively easy to prove that if the traffic is uniform, then the BVN decomposition consist on a set of N permutations providing full-access. These can be cycled blindly in order to serve the VOQs. ... this would mean an O(1) scheduling complexity • The only condition over this set of N permutations is that they provide full-access (i.e., for any input-ouput pair, at least one permutation out of this set connect these input and output). Ex: one cycle for N=4 p0 p1 p2 p3 , , , full-access

So... Is there a way to pre-process an irregular traffic load such that the inputs of the switch “see” an uniform load? Answer: Yes! It is called “Load Balancing”. There are several ways to do that... The simplest (deterministic) consist on adding an additional input switch stage, which runs through a periodic sequence ofconnection patterns that realize full access...

II. Deterministic Load Balancing Deterministic Load-Balancing is achieved by running an input switch through a sequence ofperiodic connection patterns that realize full access... (1) input load balancing... 0 1 ... ... “wild” input traffic pattern... Load-Balancing “subdued” traffic! (2) destination (output) balancing... … N-1 • (1) Input load is equally distributed at the outputs • (2) Bursty traffic is also distributed

The Load-balanced BVN Switch … … … … … … • The Load-Balancing stage runs through a periodic sequence ofconnection patterns that realize full access...just like the Crossbar Stage, because traffic it sees is just uniform. Crossbar stage Buffer (VOQ) stage Load balancing stage 0 0 1 1 Crossbar (TDM) Stage Load-Balancing Stage … … … … N-1 N-1 A buffer maintains N VOQ FIFO queues • Moreover, it is possible to prove that this two-stage architecture provides 100%throughput on a very general class of traffic [Chang&Valiant]

III. Implementation of an optical Load-balanced switch Why the Deterministic LBS is suited for optical implementation? (1) Given the particularly simple interconnection requirements (TDM permutation schedule) of the load-balancing and switching stage, both stages can be efficiently implemented using a guided-wave-based Stage-Controlled Banyan Network (SC-BN); (2) Because of the deterministic, cyclic schedule, it is possible to emulate the VOQs FIFO queue stage using delay-lines, instead of real RAM memory... main topic of this presentation!

0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 (1) Emulation of the load-balancing and TDM switches by stage-controlled Banyan network (SC-BN) • A N x N Banyan network is composed of log2N stages. • Each stage is made of (N/2) 2 x 2 switches. • In a SC-BN, all switches within a stage are set either in the • bar state or cross state. • The N possible permutations of a SC-BN provide full access stage 0 stage 1 stage 2 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 8 x 8 Omega network

Example: SC-BN with EA gates 8 x 8 Omega network

(2) Emulation of VOQ buffers using delay-lines (c) ... but at time t, LBS permutation was “scrambling” data, so packet is stored in queue N-1 of a different buffer: (a) ... A packet arrives at time t to port 1, with destination port N-1: … (d) Last, this packet has to wait a deterministic amount of time, for the correct permutation to be available at the second TDM stage: (b) If LBS were not operating, the packet would be stored in queue N-1 of buffer N-1: … (...plus a multiple of the whole cycle, if some packet was previouly scheduled for the same output)

… … … … … … Crossbar (TDM) Stage … … TSI “buffer” Concretely: • A packet arriving to port r at time t with destination d has to be delayed by Dt = + kN timeslotswhere: • While  is fixed by the packet, the parameterk can be freely tuned by the scheduling algorithm; • Such “freedom” will be used to avoid collision of packets previously scheduled for the same output, thus effectively simulating a FIFO queue. The way k is chosen depend on the actual TSI architecture. The nice thing is that because the total delay can be computed in advance, there is no need of real memory buffers: a Time-Slot Interchanger architecture (relying on delay-lines) will effectively simulate the VOQs!

0 1 1 2 1xNb ... Nb N.b-1 III.1 : Single-stage TSI architecture • number of delay lines: N.b • delay increment = 1 time slot • maximum delay: bN-1 • total fiber length: • equivalent VOQ FIFO size (equal to the • maximum delay +1 divided by N): b optical switch ...performance of this architecture is strictly equivalent to that of a VOQ based buffer when using a deterministic schedule!

0 1 1 2 1xNb ... Nb N.b-1 Contention Resolution So, a packet arriving at t with destination d at the input of the optical buffer has to be delayed Dt = + kN time slots, where  = (d-r-t) mod N. Constraint:thepacket may collide with another one when exiting the buffer at point A. k has to has to be choosen so as to avoid contention at the output of the TSI buffer risk of packet collision A How? The maximum delay that a packet can be given is Nb-1: Need to keep track of the schedule of the Nb-1 previous time slots by using an electronic memory of size Nb-1 (or, more simply, a single counter - but then the strategy does not generalize to multi-stage buffers). Check for a free schedule, i.e, choose a cycle-delay k indicating a free space. A maximum of b checks are needed. In our simulations, k is choosen as the smaller index that indicates a free space, so as to minimize packet delay, but more complex selection can be done to account for packet priorities. Rem: if a packet cannot be scheduled, it will be discarded (so in fact the switch is a 1x(Nb+1) switch, whose last line is the discard line.

N t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11    P0 P1 P2 P3 P0 P1 P2 ... P3 P0 P1 P2 P3 0 1 2 3 0 1 2 3 0 1 2 3 Example: N=4, b=3 Packet Schedule Memory The resulting scheduling algorithm is O(b)(and can be made constant using a single counter) occupied A packet arrives at time t, when permutation P3 is on the TDM switch. However, packet destination requires P1. Then, we have Dt =2 + k.N (=2) free irrelevant for scheduling this packet  k=2   Packet Schedule (total memory cells: Nb-1=11) k.N ... TDM Permutation schedule: Time:

interesting remark: because contention is resolved by the scheduling algorithm, the following hardware performs equally well: ...the advantage being a large reduction on the number of fiber delay lines employed: in the first case we need bN(bN-1)/2, while in the second implementation only Nb. This is important when considering scaling the number of input-outputs (N) or the amount of buffering (b).

LBS performance using a single-stage TSI (simulation) N=16 input/outputs 108 packets / load rem: b=30 corresponds to a FIFO buffer holding a maximum of 30 packets: this is very little compared with the thousands of some shared memory buffers on the market... (Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)

LBS performance using a single-stage TSI (simulation) N=16 input/outputs 108 packets / load (Rem: traffic is assumed to be i.i.d Bernouilli at the exit of the LB stage)

feasability problems (single stage) Broadcast & Select Module (EA module) Merging module Fiber delay line module EDFA （ブースタアンプ） EDFA (プリアンプ） Output Input EDFA Saturated output 20dBm 132 broadcast +20dB 13dBm以下！ -15dB EA and Interfacing loss +20dB -10dB 仮定：入力信号レベル 0dBm EA Valid Range Fiber and interfacing loss Waveguide and interfacing loss -2.5dB -2.5dB -10dBm EDFA Minimum Input -30dB !!!

1 1 0 0 1 b0 2 2 1xb0 1xb1 ... ... b0 b1 b0-1 (b1-1).b0 III.2 : Double-Stage TSI buffer Why? Because of architectural considerations: for a constant total amount of delay, a multistage architecture uses much less fiber delay lines => small switches!! b0 FDLs increment: 1 time slot, maximum delay: b0-1 b1 FDLs increment: b0 time slots • number of delay lines: b0+b1 • ... vs. b0.b1 in the case of a single stage. • delay increment (depends on the stage): 1 for first stage, b0 for second stage. • maximum delay: b1.b0-1 • total fiber length: • equivalent VOQ FIFO size: • By making the minimal increment of first stage equal to the maximum delay of first stage plus one, we ensure a unique decomposition of the required total delay Dt which further simplifies scheduling complexity...

1 1 0 0 1 N 2 2 1xN 1xb ... ... N b N-1 (b-1).N ...in the following, we will consider that b0=N (the number of input outputs) and b1=b will be variable, corresponding to the equivalent size of a VOQ FIFO buffer: sub-cycle delay stage () cycle delay stage (k) • number of delay lines: b+N vs. b.N for single stage. • delay increment (depends on the stage): 1 for first stage, N for second stage. • maximum delay: b.N-1 • total fiber length: ... vs. for single stage. • equivalent VOQ FIFO size: b=b1

1 1 1 1 0 0 0 0 1 1 N 2 2 2 2 1xN 1xN ... ... ... ... 1xb 1xb N N b b N N - - 1 1 (b-1).N S1 S2 Contention • Now there are 2 locations where contention can happen: • at the exit A of first stage (S1) • - at the exit B of second (and final) stage (S2) A B Exit of the stage S1: The maximum delay that a packet can be given in the stage S1 is (b-1)N time slots  Need to keep track of the (b-1)N previous time slots. • Need for an electronic memory MEM_S1 of size (b-1)N that will indicate which time slots at the exit of S1 are “busy” or “free”. Exit of the stage S2: The maximum delay that a packet can be given for the whole optical buffer is (b-1)N+N-1= bN -1 • Need to keep track of the bN -1 previous time slots. • Need for an electronic memory MEM_S2 of size bN-1 that will indicate which time slots at the exit of S2 are “busy” or “free”. Rem: if a packet cannot be scheduled, it will be discarded on the first stage (so in fact the first stage switch is a 1x(b+1) switch, whose last line is the discard line. Discarding a packet in other than the first stage would be necessary if one uses another scheduling strategy – for instance, a non unique delay decomposition.

1 1 0 0 B A 1 N 2 2 1xN 1xb ... ...  N b N-1 N.(b-1) 1 2 ... 1 1xb ... 2 ... 1xN b ... N N N N 1 1 1 remark: again, the contention avoidance schedule enables the following fiber-length-reducing architecture to work equally well: (in the example, b1=b, and b0=N)

Temporal diagram of the permutation schedule, the first and the second “crosspoint” schedules (MEM_S1, MEM_S2) N t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 P0 P1 P2 P3 P0 P1 P2 ... P3 P0 P1 P2 P3 0 1 2 3 0 1 2 3 0 1 2 3 Example: N=4, b=3 (b1=b, and b0=N) rem: these schedule positions do not need to be stored in memory, since they are always free at the start of a scheduling cycle MEM_S1 (b1-1)b0=(b-1)N=8 MEM_S2 b1b0-1=bN-1=11 Permutation schedule: time: The permutation schedule represents the available permutation at the exit of the TSI buffers at time t’=t+k (there are N possible permutations). The permutation schedule is not computed as a function of the traffic – as in a BVN switch. It is deterministic (TDM), therefore we do not need to store any scheduling memory array.

N t’=t +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11    P0 P1 P2 P3 P0 P1 P2 ... P3 P0 P1 P2 P3 0 1 2 3 0 1 2 3 0 1 2 3 N=4, b=3 (b1=b, and b0=N) ... a packet arrives at time t, such that the requested permutation is P1. We have then =2. occupied free irrelevant k=2 k=1 k=0 MEM_S1    (b1-1)b0=(b-1)N=8 MEM_S2 b1b0-1=bN-1=11 k.N Permutation schedule: time: The packet will be scheduled to go trough S1 at time t’=t+2N=t+8, and will exit the network trough S2 at time t’=t+2N+2 = t+10. Both cells in the considerer pair are made “busy”, and then the arrays are shifted to the left by one. b1 pairs to check = > O(b1) schedule!!

In the previous example, b1=3 pairs had to be taken into consideration... • In general a maximum of b1memory locations have to be checked. E2 E1 1xN 1xb b0=4 lines in second stage b1=3 lines in first stage So, overall complexity of the scheduling algorithm is O(b1). (a strategy using counters is not easy to implement, and may lead to sub-optimal schedules)

One vs. Two buffer-stages (for the same total fiber length) This indicates that the collision avoidance at intermediate stage, sligthly degrades performances => there is a trade-off between architectural considerations and performance.

Conclusion The proposed two-stage Load-balanced photonic switch: • Because it is a LBS, it can achieve high throughput under bursty traffic • Because deterministic balancing is used: • Guide-wave-integrable Stage-controlled Banyan networks can be used both for the switching stage and the balancing stage. • no need to employ optical memories for buffering, only fiber-delay lines functionning as a TSI • Has a scheduling complexity in O(b) , where b is the equivalent size of a electronic FIFO buffer. • Can (potentially) handle traffic priorities by making k priority-dependent. • performances only slighly degrades when comparing to a single-stage TSI (*), while: • making possible a very large reduction of the number of delay lines. • thus using “buffer space” more efficiently • it would be possible to modify the architecture so as to handle asynchronous traffic and different length packets using only TSIs, as in [Harai]. (*) performance of a single-stage based photonic switch using Nb-1 FDLs are strictly equivalent to that of a LBS using RAM buffers composed of N FIFO queues each of size b.

...Further Research: generic multi-stage delay-line buffers There are thousands of ways of implementing a generic multistage buffer. 8-8-8-8 16-8-8 One that provides a unique decomposition of the scheduled delay, however, is such that bi= li-1.bi-1 = l0l1l2…li-1. For the first stage S0, b0 is equal to a delay of one time slot. Hence, the maximum delay that can be given to a packet by the whole TSI is equal to B=l0l1l2…ln-1(this is also the maximum number of packets that the TSI can hold). For a switch with N ports, it is comparable to N VOQ queues of length Be = B/N. 16-64

Packet loss probability N=64

Average packet delay N=64

A. Cassinelli, A. Goulet, M. Ishikawa