270 likes | 411 Views
Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems. Jieming Yin , Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar , and Antonia Zhai University of Minnesota – Twin Cities. Network-on-Chips. Core. Core. Core. Core. Core. Core. Core.
E N D
Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai University of Minnesota – Twin Cities
Network-on-Chips Core Core Core Core Core Core Core Core R R R R R R R R • Scalable • Provides high bandwidth • Leads to latency • Leads to energy consumption
Heterogeneous System Data Parallel Data Parallel Data Parallel Data Parallel Super-scalar Super-scalar Super-scalar Super-scalar Only some routers are fully utilized
DVFS for Reducing NoC Energy • Dynamic Voltage and Frequency Scaling • Router energy dominates • DVFS reduces router energy, but leads to delay • Previous work are conservative on aggressiveness We need more aggressive DVFS
Limitations of Aggressive DVFS • DVFS to reduce energy • Limitations of Aggressive DVFS • Increase latency • Reduce throughput Work for limited traffic pattern Latency Sensitive Insensitive High Throughput Low Dynamic Voltage Frequency Scaling Our Previous Work * This Work Latency Throughput Contention • * Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011
Flexible-Pipeline Routers Frequency = 0.5F T T 1 2 3 4 Frequency = 0.5F 1 2 3 4 T Flexible pipeline reduces router pipeline delay
Exploiting DVFS Opportunity High utilization Mid utilization 1 Low utilization (a) Minimal path routing Src1 Dest1 1’ (b) Non-minimal path routing Src1 Dest1
Exploiting DVFS Opportunity (cont.) • Dynamic Energy: EDyn ∝ Vdd2 • Static Energy: ESta ∝ Vdd • Clock Energy: EClk ∝ (Freq* Vdd2) Operating at Mid-frequency gets most benefit
Exploiting DVFS Opportunity (cont.) 100% frequency 50% frequency 1 25% frequency Src1 Dest1 (a) Minimal path routing 1. Performance 1’ 2. Dynamic Energy 3. Static Energy Src1 Dest1 (b) Non-minimal path routing More benefit with bigger network
Outline • Introduction • Non-minimal path selection • - Issue • - Solution • - Challenges • Infrastructure (CPU+GPU) • Results • Conclusion
Non-minimal Path Routing High utilization Mid utilization Low utilization (a) Minimal path routing Src Dest (b) Non-minimal path routing Src Dest
Too Close ! High utilization Mid utilization Src Dest Low utilization (a) Minimal path routing Performance Static Energy Dynamic Energy Src Dest (b) Non-minimal path routing
Too Aggressive ! High utilization Mid utilization Low utilization Static Energy Dynamic Energy Non-minimal path routing Src1 Dest1
Dynamic Network Tuning Packet: Router: Input Initial State N Slack == 1 Utilization Monitor Y N Dx>=3 || Dy>=3 V/F Scaling Y Least busy port Min. path port Busy information propagation Slack = 0 How to determine Slack? Output
Busy Information Propagation • Busy Metrics • Buffer utilization • Crossbar utilization • Router utilization • Propagation • Regional congestion awareness • [Grot et al. HPCA08]
Regional Congestion Awareness • Local data collection • Propagation to neighboring routers • Aggregation of local & non-local data
Slack in Applications Thread 0 Thread 1 Thread 2 Thread n Thread 0 Thread 0 read miss Thread 0 ready Thread 0 schedule Slack of a packet: The number of cycles the packet can be delayed without affecting the overall execution time • CPU: Not necessarily, but assume NO slack • GPU: Based on # of threads
Tile-Based Multicore System MEM C L2 C L2 MEM C L2 G G G M M G CPU Core/ GPU SM/ L2 Cache/ MC C L2 C L2 C L2 G G G G G G MEM MEM M L2 C L2 C M R R G G G G G G
Benchmark • Benchmarks • CPU: afi,ammp, art, equake, kmeans, scalparc • GPU: blackscholes, lps, lib, nn, bfs • Evaluate ALL 30 CPU+GPU combinations • For presentation purpose, classify • CPU: 1) Memory-bound • 2) Computation-bound • GPU: 1) Latency-tolerant • 2) Latency-intolerant Based on: L1 cache miss rate Based on: Slack cycles
Benchmark Categorization memory-bound CPU + latency-tolerant GPU computation-bound CPU + latency-tolerant GPU memory-bound CPU + latency-intolerant GPU computation-bound CPU + latency-intolerant GPU Latency Sensitive Insensitive High Throughput Low
Network Energy Saving (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Energy saving is significant on certain workloads
Performance Impact (CPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU
Performance Impact (GPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Performance penalty is minimal compared to DVFS
Conclusion Non-minimal Path NoC + Balance on-chip workloads + Reduce NoC energy Latency Sensitive Insensitive High Workload Mix • High throughput • Latency Insensitive Throughput Low Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed
Exploiting Slack in GPU Predict slack based on # of available warps