Prediction Router:

Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani (Keio Univ., Japan) Michihiro Koibuchi (NII, Japan) Hideharu Amano (Keio Univ., Japan) Tsutomu Yoshinaga (UEC, Japan)

Tile architecture Many cores (e.g., processors & caches) On-chip interconnection network Why low-latency router is needed? [Dally, DAC’01] Core Router router router router router router router router router router Packet switched network 16-core tile architecture On-chip router affects the performance and cost of the chip

Number of cores increases (e.g., 64-core or more?) Their communication latency is a crucial problem Number of hops increases Why low-latency router is needed? Low-latency router architecture has been extensively studied

Outline:Prediction router for low-latency NoC • Existing low-latency routers • Speculative router • Look-ahead router • Bypassing router • Prediction router • Architecture and the prediction algorithms • Hit rate analysis • Evaluations • Hit rate, gate count, and energy consumption • Case study 1: 2-D mesh (small core size) • Case study 2: 2-D mesh (large core size) • Case study 3: Fat tree network

2) arbitration for the selected output channel 1) selecting an output channel GRANT 3) sending the packet to the output channel Wormhole router: Hardware structure Output ports Input ports ARBITER X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 CROSSBAR CORE CORE FIFO Routing, arbitration, & switch traversal are performed in a pipeline manner

Speculative router:VA/SA in parallel [Peh,HPCA’01] Pipeline structure: 3-cycle router • At least 3-cycle for traversing a router • RC (Routing computation) • VSA (Virtual channel & switch allocations) • ST (Switch traversal) • A packet transfer from router (a) to router (c) VA & SA are speculatively performed in parallel @Router B @Router C @Router A RC VSA ST RC VSA ST RC VSA ST HEAD DATA 1 ST ST ST SA SA SA DATA 2 ST ST ST SA SA SA ST ST ST DATA 3 SA SA SA 1 2 3 4 5 6 7 8 9 10 11 12 To perform RC and VSA in parallel, look-ahead routing is used At least 12-cycle for transferring a packet from router (a) to router (c) ELAPSED TIME [CYCLE]

Look-ahead router:RC/VA in parallel • At least 3-cycle for traversing a router • NRC (Next routing computation) • VSA (Virtual channel & switch allocations) • ST (Switch traversal) VSA can be performed w/o waiting for NRC Routing computation for the next hop  Output port of router (i+1) is selected by router i @Router B @Router C @Router A NRC NRC NRC VSA ST VSA ST VSA ST HEAD DATA 1 ST ST ST SA SA SA DATA 2 ST ST ST SA SA SA ST ST ST DATA 3 SA SA SA 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE]

Look-ahead router:RC/VA in parallel • At least 2-cycle for traversing a router • NRC + VSA (Next routing computation / arbitrations) • ST (Switch traversal) No dependency between NRC & VSA  NRC & VSA in parallel [Dally’s book, 2004] @Router A @Router B @Router C NRC NRC NRC Typical example of 2-cycle router HEAD ST ST ST VSA VSA VSA DATA 1 DATA 2 DATA 3 1 2 3 4 5 6 7 8 9 Packing NRC,VSA,ST into a single stage  frequency harmed At least 9-cycle for transferring a packet from router (a) to router (c) ELAPSED TIME [CYCLE]

Virtual bypassing paths Bypassed Bypassed 1-cycle 1-cycle Bypassing router: skip some stages • Bypassing between intermediate nodes • E.g., Express VCs [Kumar, ISCA’07] SRC DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle

Virtual bypassing paths Bypassed Bypassed 1-cycle 1-cycle Bypassing router: skip some stages • Bypassing between intermediate nodes • E.g., Express VCs • Pipeline bypassing utilizing the regularity of DOR • E.g., Mad postman • Pipeline stages on frequently used are skipped • E.g., Dynamic fast path • Pipeline stages on user-specific paths are skipped • E.g., Preferred path • E.g., DBP [Kumar, ISCA’07] SRC DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle [Izu, PDP’94] [Park, HOTI’07] [Michelogiannakis, NOCS’07] [Koibuchi, NOCS’08] We propose a low-latency router based on multiple predictors

Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer @Router B @Router C @Router A RC VSA ST RC VSA ST RC VSA ST HEAD DATA 1 ST ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer MISS @Router B @Router C RC VSA ST RC VSA ST RC VSA ST HEAD DATA 1 ST ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer HIT MISS @Router C RC VSA ST ST RC VSA ST HEAD ST DATA 1 ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer HIT HIT MISS RC VSA ST ST ST HEAD ST DATA 1 ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit

Efficient predictor is key Prediction router Multiple predictors for each input channel Select one of them in response to a given network environment • Random • Static Straight (SS) • An output channel on the same dimension • is selected (exploiting the regularity of DOR) • Custom • User can specify which output channel is • accelerated • Latest Port (LP) • Previously used output channel is selected • Finite Context Method (FCM) • The most frequently appeared pattern of • n -context sequence (n = 0,1,2,…) • Sampled Pattern Match (SPM) • Pattern matching using a record table Predictors Predictors A B C A B C [Burtscher, TC’02] [Jacquet, TIT’02] Prediction router: Prediction algorithms [Yoshinaga,IWIA’06] [Yoshinaga,IWIA’07] Single predictor isn’t enough for applications with different traffic patterns

Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and VSA 1st cycle: RC is performed  The prediction is correct! Predictors A B C ARBITER X+ X+ FIFO Correct X- X- Y+ Y+ Crossbar is reserved Y- Y- CORE 5x5 XBAR CORE Basic operation @ Correct prediction 2nd cycle: Next flit is transferred to X+ without RC and VSA 1-cycle transfer using the reserved crossbar-port when prediction hits

Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and VSA 1st cycle: RC is performed  The prediction is wrong! (X- is correct) Predictors Kill signal to X+ is asserted KILL A B C ARBITER X+ X+ FIFO X- Dead flit X- Correct Y+ Y+ Y- Y- CORE 5x5 XBAR CORE Basic operation @ Miss prediction 2nd/3rd cycle: Dead flit is removed; retransmission to the correct port More energy for retransmission Even with miss prediction, a flit is transferred in 3-cycle as original router

Prediction hit rate analysis • Formulas to calculate the prediction hit rates on • 2-D torus (Random, LP, SS, FCM, and SPM) • 2-D mesh (Random, LP, SS, FCM, and SPM) • Fat tree (Random and LRU) • To forecast which prediction algorithm is suited for a given network environment w/o simulations • Accuracy of the analytical model is confirmed through simulations Derivation of the formulas is omitted in this talk (See “Section 4” of our paper for more detail)

Evaluation items How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] Table 2: Process library Table 1: Router & network parameters Table 3: CAD tools used *Topology and traffic are mentioned later

3 case studies of prediction router How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] 2-D mesh network Fat tree network • The most popular network topology • MIT’s RAW [Taylor,ISCA’04] • Intel’s 80-core [Vangal,ISSCC’07] • Dimension-order routing (XY routing) •  Here, we show the results of case studies 1 and 2 together Case study 1 & 2 Case study 3

48.2% reduced for 16x16 cores 35.8% reduced for 8x8 cores Case study 1: Zero-load comm.latency • Original router • Pred router (SS) • Pred router (100% hit) Uniform random traffic on 4x4 to 16x16 meshes (*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction  Simulation results (analytical model also shows the same result) Comm. latency [cycles] More latency reduced (48% for k=16) as network size increases Network size (k-ary 2-mesh)

SS: go straight LP: the last one FCM: frequently used pattern Efficient for long straight comm. Case study 2: Hit rate @ 8x8 mesh Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics

SS: go straight LP: the last one FCM: frequently used pattern Case study 2: Hit rate @ 8x8 mesh Efficient for long straight comm. Efficient for short repeated comm. Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics

SS: go straight LP: the last one FCM: frequently used pattern • Existing bypassing routers use • Only a static or a single bypassing policy • Prediction router supports • Multiple predictors which can be switched in a cycle • To accelerate a wider range of applications However, effective bypassing policy depends on traffic patterns… Case study 2: Hit rate @ 8x8 mesh Efficient for long straight comm. Efficient for short repeated comm. All arounder ! Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics

Area (gate count) Original router Pred router (SS + LP) Pred router (SS+LP+FCM) Energy consumption FCM is all-arounder, but requires counters Case study 2: Area & Energy Light-weight (small overhead) Verilog-HDL designs Router area [kilo gates] Synthesized with 65nm library 6.4 - 15.9% increased, depending on type and number of predictors

Area (gate count) Original router Pred router (SS + LP) Pred router (SS+LP+FCM) Energy consumption Original router Pred router (70% hit) Pred router (100% hit) Case study 2: Area & Energy • This estimation is pessimistic. • More energy consumed in links  Effect of router energy overhead is reduced • Application will be finished early  More energy saved Router area [kilo gates] Flit switching energy [pJ / bit] 6.4 - 15.9% increased, depending on type and number of predictors Miss prediction consumes power; 9.5% increased if hit rate is 70% Latency 35.8%-48.2% saved w/ reasonable area/energy overheads

3 case studies of prediction router How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] 2-D mesh network Fat tree network Case study 1 & 2 Case study 3

Case study 3: Fat tree network Down Up 1. LRU algorithm LRU output port is selected for upward transfer 2. LRU + LP algorithm Plus, LP for downward transfer

Comm. latency @uniform Original router Pred router (LRU) Pred router (LRU + LP) Case study 3: Fat tree network Down Up Comm. latency [cycles] 1. LRU algorithm LRU output port is selected for upward transfer 2. LRU + LP algorithm Plus, LP for downward transfer Network size (# of cores) Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)

Area overhead: 6.4% (SS+LP) Energy overhead: 9.5% (worst) Latency reduction: up to 48% (from Case studies 1 & 2) Summary of the prediction router • Prediction router for low-latency NoCs • Multiple predictors, which can be switched in a cycle • Architecture and six prediction algorithms • Analytical model of prediction hit rates • Evaluations of prediction router • Case study 1 : 2-D mesh (small core size) • Case study 2 : 2-D mesh (large core size) • Case study 3 : Fat tree network • Results • Prediction router can be applied to various NoCs • Communication latency reduced with small overheads 3. Prediction router with multiple predictors can accelerate a wider range of applications From three case studies

Thank you for your attention It would be very helpful if you would speak slowly. Thank you in advance.

Predictors A B C Prediction router: New modifications • Predictors for each input channel • Kill mechanism to remove dead flits • Two-level arbiter • “Reservation”  higher priority • “Tentative reservation” by the pre-execution of VSA KILL signals ARBITER X+ X+ FIFO Currently, the critical path is related tothe arbiter X- X- Y+ Y+ Y- Y- 5x5 XBAR CORE CORE

Static scheme A predictor is selected by user per application Dynamic scheme A predictor is adaptively selected Prediction router: Predictor selection Predictors Predictors A B C A B C Count up if each predictor hits Configuration table A predictor is selected every n cycles (e.g., n =10,000) Flexible More energy Simple Pre-analysis is needed

Case study 1: Router critical path • RC: Routing comp. • VSA: Arbitration • ST: Switch traversal ST can be occurred in these stages of prediction router 6.2% critical path delay increased compared with original router Stage delay [FO4s] Pred router (SS) Original router

SS: go straight LP: the last one FCM: frequently used pattern Custom: user-specific path Case study 2: Hit rate @ 8x8 mesh Efficient for long straight comm. Efficient for short repeated comm. All arounder ! Efficient for simple comm. Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics

Spidergon topology Ring + across links Each router has 3-port Mesh-like 2-D layout Across first routing Hit rate @ Uniform Case study 4: Spidergon network [Coppola,ISSOC’04]

Spidergon topology Ring + across links Each router has 3-port Mesh-like 2-D layout Across first routing Hit rate @ Uniform SS: Go straight LP: Last used one FCM: Frequently used one Case study 4: Spidergon network [Coppola,ISSOC’04] Prediction hit rate [%] Hit rates of SS &FCM are almost the same Network size (# of cores) High hit rate is achieved (80% for 64core; 94% for 256core)

4 case studies of prediction router How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] 2-D mesh network Fat tree network Spidergon network Case study 1 & 2 Case study 3 Case study 4

Prediction Router: