1 / 32

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip. Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineering Texas A&M University. MIT Raw (0.18um, 300MHz) 16-core chip Four 4x4 mesh networks. Intel Polaris

feng
Download Presentation

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineering Texas A&M University

  2. MIT Raw (0.18um, 300MHz) 16-core chip Four 4x4 mesh networks Intel Polaris (65nm, 4GHz) 80-core chip 8x10 mesh network Multi-Core Wave & Networks-On-Chip • Uniprocessors hit the power wall. • Multi-processors provide high performance at lower power budget. • Shared-bus architecture has scalability limitation. • Networks-On-Chip (NOCs) orchestrate chip-wide communications towards future many-core processors. Lei Wang - NOCS 2009

  3. Challenges in On-Chip Communication • High performance • Low communication latency is critical for high system performance. • Bandwidth-efficient • Well-designed routing algorithms provide high network throughput. • Power and Area Constraints • Simple topologies and slim routers reduce communication power consumption and save chip area. • Efficient Multicast supporting • Cache coherence protocols heavily rely on multicast or broadcast communication characteristics. We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption. Lei Wang - NOCS 2009

  4. Prior Work in Multicast Communication • Routing Evaluation Criteria for Multicast Communication [Ni93] • Multicast in multicomputer system • Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] • Short message multicast in DSM system • Virtual Circuit Tree Multicasting for NOCs[Lipasti08] • Demonstrate necessity of multicasting on-chip • Propose table-based multicast routing • Region-based Multicast for CMPs [Duato08] • Multicast routing for irregular topology in CMPs Lei Wang - NOCS 2009

  5. Outline • Motivation • Multicast Router Design • State-of-art Unicast Router Architecture • Replication Schemes • Destination List Management • Recursive Partitioning Multicast (RPM) • Network Partitioning • Routing Rules • Example • Deadlock Avoidance • Evaluation • Conclusion Lei Wang - NOCS 2009

  6. Different Bandwidth Usage Example • Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals • Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals Source Destination 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 Lei Wang - NOCS 2009

  7. State-of-Art Wormhole Unicast Router RC VA SA ST LT Router Link RC VA SA ST LT Router Link RC: Route Computation VA: VC Allocation; SA: Switch Allocation ST: Switch Traversal; LT: Link Traversal Lei Wang - NOCS 2009

  8. What we need in a Multicast Router? • Packet Replication • Synchronous Replication • Asynchronous Replication • Destination List Management • All-destination Encoding • Bit String Encoding • Multiple-region Broadcast Encoding Lei Wang - NOCS 2009

  9. Synchronous Replication • Packet replication happens at Switch Traversal Stage. H Head flit Time (Cycle) M Middle flit 0 1 2 3 Tail flit T Output 0 Input 0 T M M M H H Input 1 Output 1 Input 2 Output 2 Output 3 Input 3 Lei Wang - NOCS 2009

  10. Asynchronous Replication H Head flit Time (Cycle) M Middle flit 0 1 2 3 Tail flit T Output 0 Input 0 T M M M M H H Input 1 Output 1 Input 2 Output 2 Output 3 Input 3 Lei Wang - NOCS 2009

  11. Network Partitioning 1 0 Source node 2 N 3 7 W E 4 8 5 Eight Parts Three Parts (5, 6, 7) S Three Parts (0, 1, 7) Three Parts (3, 4, 5) Three Parts (1, 2, 3) Lei Wang - NOCS 2009

  12. Basic Routing Rules • North: top right corner. • West: top left corner. • South: bottom left corner. • East: bottom right corner. N W E S Source N N E E W W S S Destination Lei Wang - NOCS 2009

  13. Optimized Routing Rules Source Destination Deadlock!!! Lei Wang - NOCS 2009

  14. RPM Example-step 1 Multicast Packet Source Destination Partitioning M M M Lei Wang - NOCS 2009

  15. RPM Example-step 2 Multicast Packet Source Destination Partitioning M M M M Ejection Lei Wang - NOCS 2009

  16. RPM Example-step 3 Multicast Packet Source Destination Partitioning M M M M Lei Wang - NOCS 2009

  17. RPM Example-step 4 Multicast Packet Source Destination Partitioning M Ejection Ejection M M M M Ejection Lei Wang - NOCS 2009

  18. RPM Example-step 5 Multicast Packet Source Destination Partitioning M Ejection M M Lei Wang - NOCS 2009

  19. 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 Virtual Network 0 Virtual Network 1 Deadlock Avoidance • RPM has no turn restrictions, potentially introducing deadlock. • We use Virtual Network (VN) to avoid deadlock. • Two VNs lie in the same physical network. • Virtual Channels of each port are equally divided into each virtual network. • Virtual network Id (0 or 1) for each packet is decided at the source. Lei Wang - NOCS 2009

  20. Evaluation Methodology • Performance Model: Cycle-accurate Network Simulator • Models all router pipeline stages in detail • Highly parameterized • Power Model: Orion with both dynamic and leakage power models Network configuration Lei Wang - NOCS 2009

  21. Uniform Random Traffic • Latency is improved around 50% before network saturation. • Network throughput is extended 40%. 50% 40% 40% Lei Wang - NOCS 2009

  22. Link Utilization 33% 45% • In low workload, RPM saves 33% link utilization. • In high workload, RPM saves 45% link utlization. Lei Wang - NOCS 2009

  23. Dynamic Power Consumption 50% 40% Lei Wang - NOCS 2009

  24. Scalability Study-Network Size Over 50% Lei Wang - NOCS 2009

  25. Scalability Study-Multicast Traffic Portion Lei Wang - NOCS 2009

  26. Scalability Study-Destination Number Lei Wang - NOCS 2009

  27. Conclusion • Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) • Bandwidth-efficient and Scalable • Performance Improvement • Up to 50% latency reduction • 33% link utilization reduction • Power Savings • Up to 40% total dynamic power savings • 25% crossbar and link power savings Lei Wang - NOCS 2009

  28. Thank you! Lei Wang - NOCS 2009

  29. Backup Lei Wang - NOCS 2009

  30. Hardware Implementation of Routing logic Lei Wang - NOCS 2009

  31. Bit Complement Traffic Lei Wang - NOCS 2009

  32. Transpose Traffic Lei Wang - NOCS 2009

More Related