410 likes | 421 Views
Explore a novel prediction router for low-latency NoC with speculative, look-ahead, and bypassing routers. Evaluate hit rates, gate count, and energy consumption in various case studies. Enhance core communication efficiency with advanced prediction algorithms.
E N D
Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani (Keio Univ., Japan) Michihiro Koibuchi (NII, Japan) Hideharu Amano (Keio Univ., Japan) Tsutomu Yoshinaga (UEC, Japan)
Tile architecture Many cores (e.g., processors & caches) On-chip interconnection network Why low-latency router is needed? [Dally, DAC’01] Core Router router router router router router router router router router Packet switched network 16-core tile architecture On-chip router affects the performance and cost of the chip
Number of cores increases (e.g., 64-core or more?) Their communication latency is a crucial problem Number of hops increases Why low-latency router is needed? Low-latency router architecture has been extensively studied
Outline:Prediction router for low-latency NoC • Existing low-latency routers • Speculative router • Look-ahead router • Bypassing router • Prediction router • Architecture and the prediction algorithms • Hit rate analysis • Evaluations • Hit rate, gate count, and energy consumption • Case study 1: 2-D mesh (small core size) • Case study 2: 2-D mesh (large core size) • Case study 3: Fat tree network
2) arbitration for the selected output channel 1) selecting an output channel GRANT 3) sending the packet to the output channel Wormhole router: Hardware structure Output ports Input ports ARBITER X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 CROSSBAR CORE CORE FIFO Routing, arbitration, & switch traversal are performed in a pipeline manner
Speculative router:VA/SA in parallel [Peh,HPCA’01] Pipeline structure: 3-cycle router • At least 3-cycle for traversing a router • RC (Routing computation) • VSA (Virtual channel & switch allocations) • ST (Switch traversal) • A packet transfer from router (a) to router (c) VA & SA are speculatively performed in parallel @Router B @Router C @Router A RC VSA ST RC VSA ST RC VSA ST HEAD DATA 1 ST ST ST SA SA SA DATA 2 ST ST ST SA SA SA ST ST ST DATA 3 SA SA SA 1 2 3 4 5 6 7 8 9 10 11 12 To perform RC and VSA in parallel, look-ahead routing is used At least 12-cycle for transferring a packet from router (a) to router (c) ELAPSED TIME [CYCLE]
Look-ahead router:RC/VA in parallel • At least 3-cycle for traversing a router • NRC (Next routing computation) • VSA (Virtual channel & switch allocations) • ST (Switch traversal) VSA can be performed w/o waiting for NRC Routing computation for the next hop Output port of router (i+1) is selected by router i @Router B @Router C @Router A NRC NRC NRC VSA ST VSA ST VSA ST HEAD DATA 1 ST ST ST SA SA SA DATA 2 ST ST ST SA SA SA ST ST ST DATA 3 SA SA SA 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE]
Look-ahead router:RC/VA in parallel • At least 2-cycle for traversing a router • NRC + VSA (Next routing computation / arbitrations) • ST (Switch traversal) No dependency between NRC & VSA NRC & VSA in parallel [Dally’s book, 2004] @Router A @Router B @Router C NRC NRC NRC Typical example of 2-cycle router HEAD ST ST ST VSA VSA VSA DATA 1 DATA 2 DATA 3 1 2 3 4 5 6 7 8 9 Packing NRC,VSA,ST into a single stage frequency harmed At least 9-cycle for transferring a packet from router (a) to router (c) ELAPSED TIME [CYCLE]
Virtual bypassing paths Bypassed Bypassed 1-cycle 1-cycle Bypassing router: skip some stages • Bypassing between intermediate nodes • E.g., Express VCs [Kumar, ISCA’07] SRC DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle
Virtual bypassing paths Bypassed Bypassed 1-cycle 1-cycle Bypassing router: skip some stages • Bypassing between intermediate nodes • E.g., Express VCs • Pipeline bypassing utilizing the regularity of DOR • E.g., Mad postman • Pipeline stages on frequently used are skipped • E.g., Dynamic fast path • Pipeline stages on user-specific paths are skipped • E.g., Preferred path • E.g., DBP [Kumar, ISCA’07] SRC DST 3-cycle 3-cycle 3-cycle 3-cycle 3-cycle [Izu, PDP’94] [Park, HOTI’07] [Michelogiannakis, NOCS’07] [Koibuchi, NOCS’08] We propose a low-latency router based on multiple predictors
Outline:Prediction router for low-latency NoC • Existing low-latency routers • Speculative router • Look-ahead router • Bypassing router • Prediction router • Architecture and the prediction algorithms • Hit rate analysis • Evaluations • Hit rate, gate count, and energy consumption • Case study 1: 2-D mesh (small core size) • Case study 2: 2-D mesh (large core size) • Case study 3: Fat tree network
Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer @Router B @Router C @Router A RC VSA ST RC VSA ST RC VSA ST HEAD DATA 1 ST ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer MISS @Router B @Router C RC VSA ST RC VSA ST RC VSA ST HEAD DATA 1 ST ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer HIT MISS @Router C RC VSA ST ST RC VSA ST HEAD ST DATA 1 ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
Prediction router for 1-cycle transfer [Yoshinaga,IWIA’06] • Each input channel has predictors • When an input channel is idle, • Predict an output port to be used (RC pre-execution) • Arbitration to use the predicted port(SA pre-execution) [Yoshinaga,IWIA’07] RC & VSA are skipped if prediction hits 1-cycle transfer HIT HIT MISS RC VSA ST ST ST HEAD ST DATA 1 ST ST DATA 2 ST ST ST ST ST ST DATA 3 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE] E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
Efficient predictor is key Prediction router Multiple predictors for each input channel Select one of them in response to a given network environment • Random • Static Straight (SS) • An output channel on the same dimension • is selected (exploiting the regularity of DOR) • Custom • User can specify which output channel is • accelerated • Latest Port (LP) • Previously used output channel is selected • Finite Context Method (FCM) • The most frequently appeared pattern of • n -context sequence (n = 0,1,2,…) • Sampled Pattern Match (SPM) • Pattern matching using a record table Predictors Predictors A B C A B C [Burtscher, TC’02] [Jacquet, TIT’02] Prediction router: Prediction algorithms [Yoshinaga,IWIA’06] [Yoshinaga,IWIA’07] Single predictor isn’t enough for applications with different traffic patterns
Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and VSA 1st cycle: RC is performed The prediction is correct! Predictors A B C ARBITER X+ X+ FIFO Correct X- X- Y+ Y+ Crossbar is reserved Y- Y- CORE 5x5 XBAR CORE Basic operation @ Correct prediction 2nd cycle: Next flit is transferred to X+ without RC and VSA 1-cycle transfer using the reserved crossbar-port when prediction hits
Idle state: Output port X+ is selected and reserved 1st cycle: Incoming flit is transferred to X+ without RC and VSA 1st cycle: RC is performed The prediction is wrong! (X- is correct) Predictors Kill signal to X+ is asserted KILL A B C ARBITER X+ X+ FIFO X- Dead flit X- Correct Y+ Y+ Y- Y- CORE 5x5 XBAR CORE Basic operation @ Miss prediction 2nd/3rd cycle: Dead flit is removed; retransmission to the correct port More energy for retransmission Even with miss prediction, a flit is transferred in 3-cycle as original router
Outline:Prediction router for low-latency NoC • Existing low-latency routers • Speculative router • Look-ahead router • Bypassing router • Prediction router • Architecture and the prediction algorithms • Hit rate analysis • Evaluations • Hit rate, gate count, and energy consumption • Case study 1: 2-D mesh (small core size) • Case study 2: 2-D mesh (large core size) • Case study 3: Fat tree network
Prediction hit rate analysis • Formulas to calculate the prediction hit rates on • 2-D torus (Random, LP, SS, FCM, and SPM) • 2-D mesh (Random, LP, SS, FCM, and SPM) • Fat tree (Random and LRU) • To forecast which prediction algorithm is suited for a given network environment w/o simulations • Accuracy of the analytical model is confirmed through simulations Derivation of the formulas is omitted in this talk (See “Section 4” of our paper for more detail)
Outline:Prediction router for low-latency NoC • Existing low-latency routers • Speculative router • Look-ahead router • Bypassing router • Prediction router • Architecture and the prediction algorithms • Hit rate analysis • Evaluations • Hit rate, gate count, and energy consumption • Case study 1: 2-D mesh (small core size) • Case study 2: 2-D mesh (large core size) • Case study 3: Fat tree network
Evaluation items How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] Table 2: Process library Table 1: Router & network parameters Table 3: CAD tools used *Topology and traffic are mentioned later
3 case studies of prediction router How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] 2-D mesh network Fat tree network • The most popular network topology • MIT’s RAW [Taylor,ISCA’04] • Intel’s 80-core [Vangal,ISSCC’07] • Dimension-order routing (XY routing) • Here, we show the results of case studies 1 and 2 together Case study 1 & 2 Case study 3
48.2% reduced for 16x16 cores 35.8% reduced for 8x8 cores Case study 1: Zero-load comm.latency • Original router • Pred router (SS) • Pred router (100% hit) Uniform random traffic on 4x4 to 16x16 meshes (*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction Simulation results (analytical model also shows the same result) Comm. latency [cycles] More latency reduced (48% for k=16) as network size increases Network size (k-ary 2-mesh)
SS: go straight LP: the last one FCM: frequently used pattern Efficient for long straight comm. Case study 2: Hit rate @ 8x8 mesh Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics
SS: go straight LP: the last one FCM: frequently used pattern Case study 2: Hit rate @ 8x8 mesh Efficient for long straight comm. Efficient for short repeated comm. Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics
SS: go straight LP: the last one FCM: frequently used pattern • Existing bypassing routers use • Only a static or a single bypassing policy • Prediction router supports • Multiple predictors which can be switched in a cycle • To accelerate a wider range of applications However, effective bypassing policy depends on traffic patterns… Case study 2: Hit rate @ 8x8 mesh Efficient for long straight comm. Efficient for short repeated comm. All arounder ! Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics
Area (gate count) Original router Pred router (SS + LP) Pred router (SS+LP+FCM) Energy consumption FCM is all-arounder, but requires counters Case study 2: Area & Energy Light-weight (small overhead) Verilog-HDL designs Router area [kilo gates] Synthesized with 65nm library 6.4 - 15.9% increased, depending on type and number of predictors
Area (gate count) Original router Pred router (SS + LP) Pred router (SS+LP+FCM) Energy consumption Original router Pred router (70% hit) Pred router (100% hit) Case study 2: Area & Energy • This estimation is pessimistic. • More energy consumed in links Effect of router energy overhead is reduced • Application will be finished early More energy saved Router area [kilo gates] Flit switching energy [pJ / bit] 6.4 - 15.9% increased, depending on type and number of predictors Miss prediction consumes power; 9.5% increased if hit rate is 70% Latency 35.8%-48.2% saved w/ reasonable area/energy overheads
3 case studies of prediction router How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] 2-D mesh network Fat tree network Case study 1 & 2 Case study 3
Case study 3: Fat tree network Down Up 1. LRU algorithm LRU output port is selected for upward transfer 2. LRU + LP algorithm Plus, LP for downward transfer
Comm. latency @uniform Original router Pred router (LRU) Pred router (LRU + LP) Case study 3: Fat tree network Down Up Comm. latency [cycles] 1. LRU algorithm LRU output port is selected for upward transfer 2. LRU + LP algorithm Plus, LP for downward transfer Network size (# of cores) Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)
Area overhead: 6.4% (SS+LP) Energy overhead: 9.5% (worst) Latency reduction: up to 48% (from Case studies 1 & 2) Summary of the prediction router • Prediction router for low-latency NoCs • Multiple predictors, which can be switched in a cycle • Architecture and six prediction algorithms • Analytical model of prediction hit rates • Evaluations of prediction router • Case study 1 : 2-D mesh (small core size) • Case study 2 : 2-D mesh (large core size) • Case study 3 : Fat tree network • Results • Prediction router can be applied to various NoCs • Communication latency reduced with small overheads 3. Prediction router with multiple predictors can accelerate a wider range of applications From three case studies
Thank you for your attention It would be very helpful if you would speak slowly. Thank you in advance.
Predictors A B C Prediction router: New modifications • Predictors for each input channel • Kill mechanism to remove dead flits • Two-level arbiter • “Reservation” higher priority • “Tentative reservation” by the pre-execution of VSA KILL signals ARBITER X+ X+ FIFO Currently, the critical path is related tothe arbiter X- X- Y+ Y+ Y- Y- 5x5 XBAR CORE CORE
Static scheme A predictor is selected by user per application Dynamic scheme A predictor is adaptively selected Prediction router: Predictor selection Predictors Predictors A B C A B C Count up if each predictor hits Configuration table A predictor is selected every n cycles (e.g., n =10,000) Flexible More energy Simple Pre-analysis is needed
Case study 1: Router critical path • RC: Routing comp. • VSA: Arbitration • ST: Switch traversal ST can be occurred in these stages of prediction router 6.2% critical path delay increased compared with original router Stage delay [FO4s] Pred router (SS) Original router
SS: go straight LP: the last one FCM: frequently used pattern Custom: user-specific path Case study 2: Hit rate @ 8x8 mesh Efficient for long straight comm. Efficient for short repeated comm. All arounder ! Efficient for simple comm. Prediction hit rate [%] 7 NAS parallel benchmark programs 4 synthesized traffics
Spidergon topology Ring + across links Each router has 3-port Mesh-like 2-D layout Across first routing Hit rate @ Uniform Case study 4: Spidergon network [Coppola,ISSOC’04]
Spidergon topology Ring + across links Each router has 3-port Mesh-like 2-D layout Across first routing Hit rate @ Uniform SS: Go straight LP: Last used one FCM: Frequently used one Case study 4: Spidergon network [Coppola,ISSOC’04] Prediction hit rate [%] Hit rates of SS &FCM are almost the same Network size (# of cores) High hit rate is achieved (80% for 64core; 94% for 256core)
4 case studies of prediction router How many cycles ? Astro (place & route) FIFO hit NC-Verilog (simulation) FIFO XBAR SDF SAIF miss hit hit Design compiler(synthesis) Power compiler Fujitsu 65nm library Flit-level net simulation Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit] 2-D mesh network Fat tree network Spidergon network Case study 1 & 2 Case study 3 Case study 4