320 likes | 539 Views
Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems. Jieming Yin * , Pingqiang Zhou + , Sachin S. Sapatnekar * and Antonia Zhai *. * University of Minnesota, Twin Cities, USA + ShanghaiTech University, China.
E N D
Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoCfor Heterogeneous Multicore Systems Jieming Yin*, Pingqiang Zhou+, Sachin S. Sapatnekar* and Antonia Zhai* * University of Minnesota, Twin Cities, USA +ShanghaiTech University, China 28th IEEE International Parallel & Distributed Processing Symposium
Heterogeneous Multicore System CPU CPU GPU GPU GPU GPU Interconnection Network L2 L2 MEM MEM ShanghaiTech
On-chip Traffic Characteristics Traffic Pattern Switching Mechanism Packet Switching Erratic Random Latency-sensitive CPU Circuit Switching Streaming Dedicated Throughput-intensive GPU NoCs must handle different traffic differently ShanghaiTech
Packet Switching vs. Circuit Switching Performance Perspective Src node Intm. node1 Intm. node2 Intm. node3 Dest node Src node Intm. node1 Intm. node2 Intm. node3 Dest node link traversal setup link traversal data router pipeline router pipeline Network delay Setup delay ack data Network delay Packet-switched Circuit-switched ShanghaiTech
Packet Switching vs. Circuit Switching Energy Perspective Allocation & Arbitration Allocation & Arbitration Allocation & Arbitration Packet-switched Circuit-switched Circuit-switched NoC: potentially energy efficient for certain traffic pattern ShanghaiTech
Packet Switching or Circuit Switching Packet SwitchingFlexible, Scalable Latency, Energy Frequency Regular Erratic • Circuit • Switching • Packet Switching Fixed • Circuit Switching Latency, Energy • Setup, Maintenance Destination • Packet Switching • Packet Switching Random NoC with both packet and circuit switching? ShanghaiTech
Multi-plane vs. Single-plane Multi-plane: Independent packet-switched (PS) and circuit-switched (CS) planes Increasing hardware requirement Low resource utilization PS CS Single-plane: Packet and circuit switching sharing the same communication fabric PS+CS How can Packet and Circuit Switching share the same fabric? ShanghaiTech
Space-Division Multiplexing 4 bits A A A 2 bits B SDM B B 1 bits C (Space-division Multiplexing) C C 1 bits D D D Physically divide a channel into sub-channels PS+CS • K. Lusala et al., IJRC 2012 • S. Secchi et al., DSD 2008 • A. K. Lusala, ReCoSoC 2011 • M. Modarressi et al., DATE 2009 SDM suffers from packet serialization problem ShanghaiTech
Time-Division Multiplexing time 0 1 2 3 4 5 6 7 A A D C B B A A A A TDM B B (Time-division Multiplexing) A B C D C C 8 bits D D PS+CS We propose TDM-based hybrid-switched NoC ! ShanghaiTech
Outline • Introduction • Design TDM-based Hybrid-switching NoC • Optimizations for Hybrid Switching • Conclusion ShanghaiTech
Hybrid-switched Router Routing Logic VC Allocator Packet-switched SW Allocator Input 1 BW RC VA HP ST SA ST Circuit-switched Output 1 Packet-switched Pipeline Circuit-switched Pipeline Slot Table Packet-switched Output n Input n Circuit-switched Crossbar Slot Table ShanghaiTech
Circuit-switched Path Setup t0 R2 R3 R0 R1 CS t0 R0 R1 R2 t1 CS t2 t3 CS t4 R5 R4 R3 t5 CS t6 t7 • Set up the path before transmission • Setup messages are sent through the packet-switched network • Acknowledge the source upon successful setup Keep time-slot assignment in Slot Tables ShanghaiTech
Slot Table Configuration Walkthrough v v v v v v v v out out out out out out out out setup 1 setup 2 in_1 in_2 ① ② in_1 in_2 (succeed) (fail) s0 s0 in_1 → out_4 in_1 → out_3 s1 s1 slot_id = 2 slot_id = 3 s2 s2 duration = 2 duration = 1 s3 s3 teardown 1 ③ in_1 in_2 ④ in_1 in_2 in_1 → out_4 s0 slot_id = 2 s0 duration = 2 s1 s1 s2 s2 s3 s3 ShanghaiTech
Slot Table Size V.S. • Larger slot table • More energy overhead • Longer packet waiting time • Finer-grain multiplexing Smaller slot table • Less energy overhead • Smaller packet waiting time • Coarser-grain multiplexing Slot table more request more request active inactive Initial (reset) (reset) Slot table size should be adjusted dynamically ShanghaiTech
Circuit-Switched Path Exclusiveness Slot Table v out SW Allocator s0 Exclusively occupied by circuit-switched paths 1 out_3 s1 1 out_3 Crossbar configuration signals s2 0 (PS) s3 1 out_2 s4 1 out_2 s5 0 (PS) s6 1 out_1 s7 1 out_1 • Crossbar must be configured before a circuit-switched flit’s arrival. Time slot is wasted if circuit-switched flit is not presented. ShanghaiTech
Time-slot Stealing Slot Table Crossbar Line Address v out Decoder configuration signals SW Allocator valid CS flit enable Enable path reuse between packet- and circuit-switched data paths From upstream router VC Allocator ShanghaiTech
Hybrid-switched Network • Path Setup • Endpoint Selection: Frequent communication pairs • Route Selection: Adaptive Routing • Switching Decision • Referring to packet slack* Routing decision is made based on the utilization of slot tables in neighbor routers *J. Yin et al., ISLPED 2012 ShanghaiTech
Full System Evaluation Platform MEM C L2 C L2 C L2 MEM M G G G G M CPU Core/ GPU SM/ L2 Cache/ MC C L2 C L2 C L2 G G G G G G MEM MEM R R M L2 C L2 C M G G G G G G • Benchmarks • CPU: ammp, applu, art, equake, gafort, mgrid, swim, wupwise • GPU: blackscholes, lps, lib, nn, hotspot, pathfinder, sto ShanghaiTech
Performance Evaluation ↑ 0.3% CPU CPU performance impact is negligible ↑ 4.1% GPU GPU performance is improved ShanghaiTech
Network Energy Evaluation 6.3% saving ShanghaiTech
Overall – Basic Hybrid-switched NoC 0.3% CPU performance improvement 4.1% GPU performance improvement 6.3% Network energy reduction Can we do better? CPU Speedup GPU Speedup Network Energy ShanghaiTech
Outline • Introduction • Design TDM-based Hybrid-switching NoC • Optimizations for Hybrid Switching • Conclusion ShanghaiTech
Opportunity: Low Path Utilization Overlapped paths Circuit-switched paths are under utilized • Large number of overlapped circuit-switched paths • Circuit-switched paths are not fully utilized • Waste of on-chip resource (slot-tables) ShanghaiTech
Optimization: Path Sharing Hitchhiker-sharing Circuit-switched Path Hitchhiker-sharing Sources Vicinity-sharing Circuit-switched Path Vicinity-sharing Destinations Enable path reuse among circuit-switched data paths ShanghaiTech
Performance Evaluation ↑ 0.3% ↑ 0.2% CPU ↑ 4.1% ↑ 3.7% GPU ShanghaiTech
Network Energy Evaluation 6.3% saving 9.0% saving Can we do EVEN better? ShanghaiTech
Opportunity: Lower Buffer Pressure Percentage of flits that are circuit-switched Packet-switched Circuit-switched Observation: Circuit switching diverts on-chip traffic, alleviating the buffer pressure on packet-switched data paths. ShanghaiTech
Optimization: Aggressive Power-gating inactive Circuit switching some of the packets alleviates buffer pressure, facilitates more aggressive power gating. Packet-switched Input 1 Circuit-switched active Slot Table Reduce dynamic and leakage power dissipation ShanghaiTech
Performance Evaluation ↓ 1.6% ↑ 0.3% ↑ 0.2% CPU ↑ 2.6% ↑ 4.1% ↑ 3.7% GPU ShanghaiTech
Network Energy Evaluation 6.3% saving 9.0% saving 17.1% saving Energy saving is significant ShanghaiTech
Overall 1.6% CPU performance degradation 2.6% GPU performance improvement 17.1% Network energy reduction CPU Speedup GPU Speedup Network Energy ShanghaiTech
Conclusion • TDM-based Hybrid-switched Network • TDM is an efficient way to enable on-chip resource sharing • Hybrid-switched NoC handles different traffic differently • Performance • Energy efficiency • Scalability (in paper) ShanghaiTech