520 likes | 648 Views
The Case for Embedded NoCs on FPGAs. Mohamed ABDELFATTAH Vaughn BETZ. Outline. 1. Why NoCs on FPGAs?. 2. Embedded NoCs. 3. Area & Power Analysis. 4. Comparison Against P2P/Buses. 1. Why NoCs on FPGAs?. Motivation. Logic Blocks. Switch Blocks. Wires. Interconnect.
E N D
The Case for Embedded NoCson FPGAs Mohamed ABDELFATTAH Vaughn BETZ
Outline 1 Why NoCs on FPGAs? 2 Embedded NoCs 3 Area & Power Analysis 4 Comparison Against P2P/Buses
1. Why NoCs on FPGAs? Motivation Logic Blocks Switch Blocks Wires Interconnect
1. Why NoCs on FPGAs? Motivation Logic Blocks Switch Blocks • Hard Blocks: • Memory • Multiplier • Processor Wires
1. Why NoCs on FPGAs? Motivation 1600 MHz Hard Interfaces DDR/PCIe .. Logic Blocks 800 MHz Switch Blocks Interconnect still the same • Hard Blocks: • Memory • Multiplier • Processor Wires 200 MHz
1. Why NoCs on FPGAs? Motivation 1600 MHz Problems: • Bandwidth requirements for hard logic/interfaces • Timing closure DDR3 PHY and Controller PCIe Controller 800 MHz 200 MHz Gigabit Ethernet
1. Why NoCs on FPGAs? Motivation Problems: • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated DDR3 PHY and Controller PCIe Controller Gigabit Ethernet
Source: Google Earth Los Angeles Barcelona Keep the “roads”, but add “freeways”. Logic Cluster Hard Blocks
1. Why NoCs on FPGAs? FPGA with NoC NoC Problems: • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated DDR3 PHY and Controller Router forwards data packet PCIe Controller Links Router moves data to local interconnect Routers Gigabit Ethernet
1. Why NoCs on FPGAs? FPGA with NoC Problems: • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Abstraction favours modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller PCIe Controller • High bandwidth endpoints known • Pre-design NoC to requirements Gigabit Ethernet • NoC links are “re-usable” • NoC is heavily “pipelined” • NoC abstraction favors modularity
1. Why NoCs on FPGAs? FPGA with NoC Problems: • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Abstraction favours modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller NoCs can simplify FPGA design PCIe Controller How to integrate NoCs in FPGAs? Does the NoC abstraction come at a high area/power cost? How do embedded NoCs compare to current interconnects? Gigabit Ethernet • Latency-tolerant communication • NoC abstraction favors modularity
Outline 1 Why NoCs on FPGAs? 2 Embedded NoCs Mixed NoCs Hard NoCs 3 Area & Power Analysis 4 Comparison Against P2P/Buses
2. Embedded NoCs Embedded NoCs = + “Soft” NoC Soft Routers Soft Links = + “Mixed” NoC Hard Routers Soft Links = + “Hard” NoC Hard Routers Hard Links
Methodology Soft Mixed Hard FPGA CAD Tools ASIC CAD Tools Area Speed Design Compiler Power? Power HSPICE Gate-level simulation Gate-level simulation Toggle rates
2. Embedded NoCs Mixed NoCs Logic blocks FPGA Programmable “soft” interconnect Router Baseline Router = + “Mixed” NoC Hard Routers Soft Links
2. Embedded NoCs Mixed NoCs FPGA Router = + “Mixed” NoC Hard Routers Soft Links 16
2. Embedded NoCs Mixed NoCs FPGA Router Special Feature Configurable topology Assumed a mesh Can form any topology
2. Embedded NoCs Hard NoCs Logic blocks FPGA Programmable “soft” interconnect Dedicated “hard” interconnect Router = + “Hard” NoC Hard Routers Hard Links 18
2. Embedded NoCs Hard NoCs FPGA Router = + “Hard” NoC Hard Routers Hard Links 19
2. Embedded NoCs Hard NoCs 1.1 V 0.9 V FPGA Router Special Feature Low-V mode Save 33% Dynamic Power ~15% slower = + “Hard” NoC Hard Routers Hard Links 20
2. Embedded NoCs Fabric Port Bridge NoC and FPGA fabric: • Width adaptation • Frequency adaptation • Voltage adaptation • Bus protocol e.g. AXI 21
Outline 1 Why NoCs on FPGAs? 2 Embedded NoCs 3 Area & Power Analysis System Area/Power Soft vs. mixed vs.Hard 4 Comparison Against P2P/Buses
3. Area/Power Analysis Router Microarchitecture • State-of-the-art router architecture from Stanford: • NoC community have excelled at building on-chip routers: We just use it • To meet FPGA bandwidth requirements: High-performance router • Complex functionality such as virtual channels: Assigning traffic priority could be useful
3. Area/Power Analysis Routers and Links Hard Router vs. Soft Router 30X smaller, 6X faster, 14X lower power Hard Links vs. Soft Links 9X smaller, 2.4X faster, 1.4X lower power
3. Area/Power Analysis Soft, Mixed and Hard [65 nm] 64-node NoC on Stratix III Hard Mixed Soft 448 LBs 576 LBs ~12,500 LBs Area 33% of FPGA ~ 1.5% of FPGA 64 – NoC Speed 730 – 940 MHz 166 MHz ~ 50 GB/s Speed ~ 10 GB/s Bisection BW
3. Area/Power Analysis Soft, Mixed and Hard [65 nm] 64-node NoC on Stratix III Provides ~50GB/s peak bisection bandwidth Very Cheap! Less than cost of 3 soft nodes Hard (Low-V) Mixed Soft 448 LBs 576 LBs ~12,500 LBs Area 33% of FPGA ~ 1.5% of FPGA 64 – NoC Speed 730 – 940 MHz 166 MHz ~ 50 GB/s Speed ~ 10 GB/s Bisection BW
3. Area/Power Analysis NoC Power Budget 250 GB/s total bandwidth 123% How much is used for system-level communication? 17.4 W Largest Stratix-III device Typical FPGA Dynamic Power
3. Area/Power Analysis NoC Power Budget 250 GB/s total bandwidth 123% 15% NoC 17.4 W Typical FPGA Dynamic Power
3. Area/Power Analysis NoC Power Budget 250 GB/s total bandwidth 11% 123% 15% NoC 17.4 W Typical FPGA Dynamic Power
3. Area/Power Analysis NoC Power Budget 250 GB/s total bandwidth 7% 11% 123% 15% NoC 17.4 W Typical FPGA Dynamic Power
3. Area/Power Analysis Bandwidth in Perspective DDR3 Module 1 PCIe Module 2 14.6 GB/s Full theoretical BW 14.6 GB/s Cross whole chip! 17 GB/s 17 GB/s 17 GB/s 17 GB/s 14.6 GB/s Aggregate Bandwidth 126 GB/s 14.6 GB/s NoC Power Budget 3.5%
Outline 1 Why NoCs on FPGAs? 2 Embedded NoCs 3 Area &Power Analysis 4 Comparison Against P2P/Buses Point-to-point links Qsys Buses
4. Comparison FPGA Interconnect Interconnect = Wires + Logic Interconnect = Just wires Interconnect = NoC Point-to-point Links 1 1 Multiple Masters 1 .. .. 1 .. Compare “wires” interconnect to NoCs Mux + Arbiter 1 .. .. .. .. Broadcast n .. .. .. .. 1 1 .. n .. .. Multiple Masters, Multiple Slaves n Mux + Arbiter 1 1 Mux + Arbiter n n
4. Comparison NoC Power vs. FPGA Interconnect High Performance / Packet Switched Length of 1 NoC Link 1 % area overhead on Stratix 5 200 MHz Runs at 730-943 MHz Power on-par with simplest FPGA interconnect Hard and Mixed NoCs Area/Power Efficient
4. Comparison DDR3: Qsys Bus vs. NoC Embedded NoC: 16 Nodes, hard routers & links Qsys bus: Build logical bus from fabric
4. Comparison Design Effort close • Steps to close timing using Qsys FPGA
4. Comparison Design Effort far • Steps to close timing using Qsys FPGA
4. Comparison Design Effort far • Steps to close timing using Qsys FPGA Timing closure can be simplified with an embedded NoC
4. Comparison Area Comparison
4. Comparison Area Comparison
4. Comparison Area Comparison Entire NoC smaller than bus for 3 modules!
4. Comparison Area Comparison 1/8 Hard NoC BW used already less area for most systems
4. Comparison Power Comparison Hard NoC saves power for even the simplest systems
Why NoCs on FPGAs? 1 Big city needs freeways to handle traffic Embedded NoCs: Mixed & Hard 2 Power: 9-15X Area: 20-23X Speed: 5-6X Area & Power Analysis 3 • Area Budget for 64 nodes: ~1% • Power Budget for 100 GB/s: 3-7% 4 Comparison Against P2P/Buses • Raw efficiency close to simplest P2P links • NoC more efficient & lower design effort
Thank You! eecg.utoronto.ca/~mohamed/noc_designer.html
2. Embedded NoCs Fabric Port • 200 MHz 128-bit module, 900 MHz 32-bit router? • Configurable time-domain mux / demux: match bandwidth • Asynchronous FIFO: cross clock domains Full NoC bandwidth, w/o clock restrictions on modules
1. Why NoCs on FPGAs? Compute Acceleration GPU CPU • Maxeler • Geoscience (14x, 70x) • Financial analysis (5x, 163x) • Altera OpenCL • Video compression (3x, 114x) • Information filtering (5.5x)
1. Why NoCs on FPGAs? Compute Acceleration
1. Why NoCs on FPGAs? Compute Acceleration