Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems

Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S. Sapatnekar, and Antonia Zhai University of Minnesota – Twin Cities

Network-on-Chips Core Core Core Core Core Core Core Core R R R R R R R R • Scalable • Provides high bandwidth • Leads to latency • Leads to energy consumption

Heterogeneous System Data Parallel Data Parallel Data Parallel Data Parallel Super-scalar Super-scalar Super-scalar Super-scalar Only some routers are fully utilized

DVFS for Reducing NoC Energy • Dynamic Voltage and Frequency Scaling • Router energy dominates • DVFS reduces router energy, but leads to delay • Previous work are conservative on aggressiveness We need more aggressive DVFS

Limitations of Aggressive DVFS • DVFS to reduce energy • Limitations of Aggressive DVFS • Increase latency • Reduce throughput Work for limited traffic pattern Latency Sensitive Insensitive High Throughput Low Dynamic Voltage Frequency Scaling Our Previous Work * This Work Latency Throughput Contention • * Zhou et al., NoC Frequency Scaling with Flexible-Pipeline Routers, ISLPED-2011

Flexible-Pipeline Routers Frequency = 0.5F T T 1 2 3 4 Frequency = 0.5F 1 2 3 4 T Flexible pipeline reduces router pipeline delay

Exploiting DVFS Opportunity High utilization Mid utilization 1 Low utilization (a) Minimal path routing Src1 Dest1 1’ (b) Non-minimal path routing Src1 Dest1

Exploiting DVFS Opportunity (cont.) • Dynamic Energy: EDyn ∝ Vdd2 • Static Energy: ESta ∝ Vdd • Clock Energy: EClk ∝ (Freq* Vdd2) Operating at Mid-frequency gets most benefit

Exploiting DVFS Opportunity (cont.) 100% frequency 50% frequency 1 25% frequency Src1 Dest1 (a) Minimal path routing 1. Performance 1’ 2. Dynamic Energy 3. Static Energy Src1 Dest1 (b) Non-minimal path routing More benefit with bigger network

Outline • Introduction • Non-minimal path selection • - Issue • - Solution • - Challenges • Infrastructure (CPU+GPU) • Results • Conclusion

Non-minimal Path Routing High utilization Mid utilization Low utilization (a) Minimal path routing Src Dest (b) Non-minimal path routing Src Dest

Too Close ! High utilization Mid utilization Src Dest Low utilization (a) Minimal path routing Performance Static Energy Dynamic Energy Src Dest (b) Non-minimal path routing

Too Aggressive ! High utilization Mid utilization Low utilization Static Energy Dynamic Energy Non-minimal path routing Src1 Dest1

Dynamic Network Tuning Packet: Router: Input Initial State N Slack == 1 Utilization Monitor Y N Dx>=3 || Dy>=3 V/F Scaling Y Least busy port Min. path port Busy information propagation Slack = 0 How to determine Slack? Output

Busy Information Propagation • Busy Metrics • Buffer utilization • Crossbar utilization • Router utilization • Propagation • Regional congestion awareness • [Grot et al. HPCA08]

Regional Congestion Awareness • Local data collection • Propagation to neighboring routers • Aggregation of local & non-local data

Slack in Applications Thread 0 Thread 1 Thread 2 Thread n Thread 0 Thread 0 read miss Thread 0 ready Thread 0 schedule Slack of a packet: The number of cycles the packet can be delayed without affecting the overall execution time • CPU: Not necessarily, but assume NO slack • GPU: Based on # of threads

Tile-Based Multicore System MEM C L2 C L2 MEM C L2 G G G M M G CPU Core/ GPU SM/ L2 Cache/ MC C L2 C L2 C L2 G G G G G G MEM MEM M L2 C L2 C M R R G G G G G G

Benchmark • Benchmarks • CPU: afi,ammp, art, equake, kmeans, scalparc • GPU: blackscholes, lps, lib, nn, bfs • Evaluate ALL 30 CPU+GPU combinations • For presentation purpose, classify • CPU: 1) Memory-bound • 2) Computation-bound • GPU: 1) Latency-tolerant • 2) Latency-intolerant Based on: L1 cache miss rate Based on: Slack cycles

Benchmark Categorization memory-bound CPU + latency-tolerant GPU computation-bound CPU + latency-tolerant GPU memory-bound CPU + latency-intolerant GPU computation-bound CPU + latency-intolerant GPU Latency Sensitive Insensitive High Throughput Low

Network Energy Saving (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Energy saving is significant on certain workloads

Performance Impact (CPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU

Performance Impact (GPU) (I) memory-bound CPU + latency-tolerant GPU (II) computation-bound CPU + latency-tolerant GPU (III) memory-bound CPU and latency-intolerant GPU (IV) computation-bound CPU and latency-intolerant GPU Performance penalty is minimal compared to DVFS

Conclusion Non-minimal Path NoC + Balance on-chip workloads + Reduce NoC energy Latency Sensitive Insensitive High Workload Mix • High throughput • Latency Insensitive Throughput Low Given diverse traffic pattern in heterogeneous system, non-min routing should be judiciously deployed

Thank You!

Exploiting Slack in GPU

Exploiting Slack in GPU Predict slack based on # of available warps

Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems

Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems

Presentation Transcript

Energy Efficient Propulsion Systems

Network-on-chip

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

Network-on-Chip

Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip

Network on Chip (NoC)

NETWORK ON CHIP ROUTER

Network On Chip Platform

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

Efficient Microarchitecture for Network-on-Chip Routers

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks

Some Challenges in On-Chip Interconnection Networks

Network-on-Chip

Energy-efficient Task Scheduling in Heterogeneous Environment

Energy Efficient and High Speed On-Chip Ternary Bus

Network Intrusion Detection Systems on FPGAs with On-Chip Network Interfaces

Power Issues in On-chip Interconnection Networks

NETWORK ON CHIP ROUTER

An Energy Efficient Hierarchical Heterogeneous Wireless Sensor Network Isameldin M. Suliman

Energy-Efficient Storage Systems