260 likes | 362 Views
NoC Frequency Scaling with Flexible-Pipeline Routers. Pingqiang Zhou, Jieming Yin , Antonia Zhai , and Sachin S. Sapatnekar University of Minnesota – Twin Cities. Tile-Based Multicore System. MEM. MEM. C. L1. L2. MEM. MEM. R. R. NoC dissipates substantial system energy.
E N D
NoC Frequency Scaling with Flexible-Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia Zhai, and Sachin S. Sapatnekar University of Minnesota – Twin Cities
Tile-Based Multicore System MEM MEM C L1 L2 MEM MEM R R NoC dissipates substantial system energy RAW – 36%; Intel 80-tile – 28% [Vangalet al. 2008]
VFS and Its Limitations Superscalar Machine MEM MEM • NoC is • Potential performance bottleneck • Source of energy consumption Designed for diverse traffic patterns • VFS to reduce energy • Limitations of Aggressive VFS • Reduce throughput • Increase latency Work for limited traffic pattern Latency Sensitive Insensitive MEM MEM High Throughput Low Can we make VFS work for other important traffic patterns? 3
Frequency Scaling Frequency = F 1 1 2 3 4 Frequency = 0.5F 2 1 2 3 4 Animation T Frequency scaling harms performance 4
Reconfigure Pipeline Frequency = 0.5F 1 2 3 4 T T 1 2 3 4 Frequency = 0.5F T Flexible pipeline can reduce router pipeline delay 4
Flexible Pipeline Routers + Reduce NoC energy + Negligible performance degradation Latency Sensitive Insensitive High Target Application • Low throughput • Latency sensitive Throughput Low Reduce frequency without increasing router latency 5
Outline • Background/Motivation • Router Design • Experimental Results • Related work • Conclusion 6
Baseline Router Architecture BW RC VA SA ST Route Computation VC Allocator (VA) Switch Allocator (SA) Head flit Route Computation VC Allocator (VA) Input ports BW RC VA SA ST Output ports Switch Allocator (SA) MC 1, VC 1 BW SA ST Body/tail flit MC n, VC 1 Input Controller (BW/RC) How to reconfigure pipeline? Crossbar Switch (ST) 7
Pipeline Stage Delay Delay of 4-stage pipeline: Time-borrowing • Boost pipeline frequency • Average out stage delays Tclk = 72.1τ BW+RC VA SA ST 100 τ 65.5 τ 77.7 τ 45 τ τ: inverter delay 10 The router delay model is presented in [Pehet al., HPCA 2001].
Pipeline Reconfiguration • Flex Router: pipeline reconfiguration Tclk = 72.1τ4 4-stage pipeline Vdd = 1.2 V BW+RC VA SA ST 100 τ4 65.5 τ4 77.7 τ4 45 τ4 Tclk = 93.1τ3 = 102.1τ4 3-stage pipeline Vdd = 1.0 V BW+RC VA SA+ST 100 τ3 65.5 τ3 113.7 τ3 Tclk = 135.1τ2 = 148.7τ4 2-stage pipeline Vdd = 1.0 V BW+RC VA+SA+ST 100 τ2 170.2 τ2 Tclk = 270.2τ1 = 337.7τ4 1-stage pipeline Vdd = 0.8 V BW+RC+VA+SA+ST How much hardware overhead? 270.2 τ1 10
Architecture Support R R SA Route Computation VC Allocator VA R Switch Allocator Route Computation Flits in Flits out Flits in Input Controller (with buffers) Flits out Input Controller (with buffers) BW/RC ST R R R 4-stage pipeline BW+RC VA SA ST 11
Architecture Support MUX MUX MUX MUX MUX SA R VA R R R Route Computation R MUX Flits in Flits out Input Controller (with buffers) BW/RC ST Less than 2% overhead in router area + Control Logics R R R R R R 1-stage pipeline 4-stage pipeline 2-stage pipeline 3-stage pipeline BW+RC BW+RC BW+RC BW+RC VA VA VA VA SA SA SA SA ST ST ST ST 11 MUX MUX MUX
Outline • Background/Motivation • Router Design • Experimental Results • Related work • Conclusion 12
Experimental Platform • Simulator • Full system simulator: GEMS • Power module: Wattch & Orion2.0 • Infrastructure: 8 Core, 1 issue in-order • Benchmarks • From SPEC OMP2001, NU-Mine and PARSEC C L1 1.5 GHz L2 MEM MEM R 13
Efficacy in Network Energy Saving 41% 2% Dynamic energy decreases quadratically as voltage goes down Clock energy reduction is significant (65%) Changes in static energy are minimal Base: Baseline Router Base-2: VFS, Slowdown Factor of 2 Flex-2: VFS + Flexible-Pipeline Router 14
Efficacy in Execution Time 1.5% Latency Sensitive Insensitive Average system performance degradation is reduced High Throughput Low Base: Baseline Router Base-2: VFS Flex-2: VFS + Flexible-Pipeline Router 15
System-Level Evaluation • System-level ED2 Product • Cores, caches and the interconnection networks • E: System Energy • D: System Delay System Delay System Energy Network Delay Network Energy Tradeoff 16
Efficacy in System ED2 Product ED2 increase Base: Baseline Router Base-2: VFS Flex-2: VFS + Flexible-Pipeline Router Frequency tuning should be based on workloads 16
More Aggressive VFS: Network Energy Saving 43% 39% Flexible –Pipeline Router is scalable in reducing network energy Base: Baseline Router Flex-2: Flexible-Pipeline Router + Slowdown Factor of 2 Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4 17
More Aggressive VFS: Execution Time Base: Baseline Router Flex-2: Flexible-Pipeline Router + Slowdown Factor of 2 Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4 Performance degradation is increasing 18
Limits of VFS: System ED2 Product Diminishing returns when pushing the frequency scaling limit Workload-dependent Base: Baseline Router Flex-2: Flexible-Pipeline Router + Slowdown Factor of 2 Flex-4: Flexible-Pipeline Router + Slowdown Factor of 4 19
Related Works • “A case for dynamic frequency tuning in on-chip networks” [Mishra `09] Dynamically router VFS for reducing network power consumption • Flexible-pipeline routers enable more drastic scaling • “A variable-pipeline on-chip router optimized to traffic pattern” [Hirata `10] Dynamically router VFS + variable-pipeline-routers • Flexible-pipeline routers have lower hardware overhead • Our work presents system-level evaluation 20
Conclusions • Flexible-Pipeline Router • Minimal hardware overhead • Enable aggressive VFS • System Level Implications • Considerable energy saving • Negligible performance degradation Energy Performance 21
Thank you! Q & A 21
Router Delay Model* • Router stage delay: ti: sequential logic latency h: setup delay τ: inverter delay Route Computation VC Allocator (VA) Switch Allocator (SA) Input ports Output ports MC 1, VC 1 Stage ti h BW/RC constant 0 MC n, VC 1 VA f(p, v) 9τ SA f(p, c, v) 9τ Input Controller (BW/RC) ST f(p, ω) 0 Crossbar Switch (ST) p: # of input/output ports c: # of message classes v: # of VCs/message class ω: flit size in bits *This model is presented in [Pehet al., HPCA 2001]. 9