1 / 30

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC)

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC). Ran Manevich, Isask ’ har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny. Technion – Israel Institute of Technology. May, 2009. Network on-Chip : the Good News . Interconnect for SoCs, CMPs and FPGAs

taylor
Download Presentation

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask’har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel Institute of Technology May, 2009

  2. Network on-Chip : the Good News  • Interconnect for SoCs, CMPs and FPGAs • Multi-hop, packet-based communication • Efficient resource sharing • Scalable performance and efficiency in • Power • Area • Design productivity System Bus

  3. Network on-Chip : the Bad News  • Increased and hard-to-predict latency due to multi-hop and sharing • Time critical signals • Broadcast? multicast? • No easy solutions • Slow (10s of cycles) I wish I had a bus at hand ….

  4. R R R R Module Module Module Module R R R R R R R R R R R Module Module Module Module R R R R R R R R R R R Module Module Module Module R R R R R R Module Module Module Module Solution: Bus-Enhanced NoC (BENoC) • Bus re-introduced as a NoC “add-on” • Use bus for short meta-data • Low bandwidth, low latency • Broadcast, multicast • Use NoC for data • Optimized for high bandwidth • Overhead should be justified!

  5. R R R Module Module Module R R R Module Module Module R R R Module Module Module Module Module Module Module R R R Module Module Module Module R R R Module Module Module Module R R R Module Module Module Module Related Work • In-band support of time critical communication; and: In-band Multicast/Broadcast • Complex router implementation • Suffer from multi-hop latency • Existing Bus-NoC hybrids • Form a topological hierarchy • Typically bus used for local communication

  6. BENoC Services • Fast unicast and multicast signaling • CMP cache example • Anycast • Find resources that fulfills certain conditions • E.g., “Looking for an idling DSP”; or“Where are the 5 closest multipliers?” • Convergecast • Efficient collection of feedback back to the initiator • Barrier synchronization, …

  7. Additional BENoC Applications • NoC control • Router configuration • E.g., routing table configuration • Adapt NoC routing for load balancing • Fault discovery and recovery • System control • Power management • Resource load balancing • Debug

  8. Outline • Introduction • MetaBus architecture • MetaBus latency and energy analysis • CMP cache use case

  9. Conventional System Buses Figure is copied from “Amba Specifications Rev 2.0” - http://www.arm.com/products/solutions/AMBA_Spec.html • Bandwidth optimized • Poor scalability • Not suitable for tasks in BENoC

  10. R R R R R R R R R R R R R R R R MetaBus Design Requirements • Low area, low power • Low bandwidth • Low latency • Simple • Versatile • Scalable • Multicast and broadcast support • Acknowledgement Module Module Module Module “MetaBus” Module Module Module Module Module Module Module Module Module Module Module Module

  11. MetaBus Architecture • Many possible implementations • Example: tree topology with distributed arbitration Root BusStation BusStation Module#1 BusStation BusStation BusStation Module#2 Module#3 Module#4 Module#5 Module#6 Module#7 Module#8 Module#9

  12. Data Path Data to rootData to receivers Root BusStation BusStation Module#1 BusStation BusStation BusStation Module#2 Module#3 Module#4 Module#5 Module#6 Module#7 Module#8 Module#9

  13. Example: Broadcast of Two Words Address word propagates to the root Data word 1 propagates to the modules Data word 2 Root BusStation BusStation Module#1 BusStation BusStation BusStation Module#2 Module#3 Module#4 Module#5 Module#6 Module#7 Module#8 Module#9

  14. Bus RequestBus Grant Distributed Arbitration Mechanism Root BusStation BusStation Module#1 BusStation Module#2 Module#3

  15. Masking Saves Power Unicast from Module#3 to Module#5 Address word propagates to the root Data word 1 propagates to the modules Mask1 10101 Root Mask2 Mask3 Mask4 Mask5 BusStation 1 Mask1 1 BusStation 2 Mask2 0 Mask3 1 Mask4 0 Mask5 1 Module#1 BusStation 3 BusStation 4 BusStation 5 Module#2 Module#3 Module#4 Module#5 Module#6 Module#7 Module#8 Module#9

  16. (Binary) Bus Station

  17. MetaBus Floorplan – An Example • 64 modules balanced binary MetaBus

  18. Outline • Introduction • MetaBus architecture • MetaBus Latency and energy analysis • CMP cache use case

  19. Analysis Highlights 1/4 • NoC Broadcast+Unicast Energy/Transaction:

  20. Analysis Highlights 2/4 • MetaBus Broadcast and Unicast Energy/Transaction:

  21. Analysis Highlights 3/4 • NoC unicast and broadcast latency:

  22. Analysis Highlights 4/4 • MetaBus unicast and broadcast latency:

  23. Results - Energy Consumption • Energy consumption for a 3 data words broadcast and unicast transactions 10X10 mm chip 64 modules mesh 1GHz NoC clock Speed optimized bus @0.18um Bus and NoC unicast and broadcast energy per transaction

  24. Results - Latencies • 3 data words broadcast and unicast transactions latencies insystem with a frequency and a speed optimized MetaBus. 10X10 mm chip 64 modules mesh 1GHz NoC clock Speed optimized bus @0.18um Figure 9: Bus and NoC broadcast latencies

  25. Outline • Introduction • MetaBus architecture • MetaBus Latency and energy analysis • CMP cache use case

  26. Dynamic Non-Uniform Cache Access • Split large cache into independent smaller banks • Non uniform cache access time (NUCA) • Cache lines are moved to shorten access time • Dynamic NUCA • Before fetching a into its L1$, a CPU needs to find the L2 cache storing the line CPU CPU L1$ L1$ CPU L2$ L2$ L2$ L2$ L2$ L1$ CPU L2$ L2$ L2$ L2$ L1$ CMP (Chip Multi Processor) L2$ L2$ L2$ L2$ CPU L1$ L2$ L2$ L2$ L2$ CPU L1$ L1$ L1$ CPU CPU

  27. Simulation Setup • 16 processors, 64 L2 cache banks • PARSEC and SPLASH-2 benchmarks • Vanilla Wormhole NoC • Simulation account for bus latency, arbitration time, etc.

  28. Simulation Results Performance improvement in BENoC compared to a NoC-based CMP (a) average read transaction latency; (b) application speed

  29. Summary • Current NoCs are largely distributed • Borrowing concepts from off-chip networks • On-chip environment provides an opportunity • Enhancing the network with a bus gives the best of both worlds • Advanced services are easily supported • Anycast, management and control • Cost effective • Power and performance • Analysis and simulation

  30. Bus-Enhanced NoC QNoC Research Group Thank you! Questions? zigi@tx.technion.ac.il QNoC Research Group

More Related