630 likes | 1.12k Views
DART: A Programmable Architecture for NoC Simulation on FPGAs. Danyao Wang *† Natalie Enright Jerger * J. Gregory Steffan *. *Department of Electrical & Computer Engineering University of Toronto. †Google Inc. Why yet another NoC simulator?. Software simulators
E N D
DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger* J. Gregory Steffan* *Department of Electrical & Computer Engineering University of Toronto †Google Inc. International Symposium on Network-on-Chip
Why yet another NoC simulator? • Software simulators • Stand-alone or integrated • Parallel NoC simulator (DARSIM) • FPGA-based Models • Direct map NoC emulators (Genko et al., NoCem) • Dynamic reconfiguration (DRNoC) • Decoupled timing and functional model (RAMPGold, ProtoFlex, A-Ports) • Analytical models: FIST International Symposium on Network-on-Chip
Why yet another NoC simulator? @100KIPS: 1s of execution @ 1GHz = 10K sec = 2.8 hrs Benefits of thread-based parallelization is limited due to high synchronization overhead International Symposium on Network-on-Chip
Why yet another NoC simulator? Orders of magnitude faster! Hardware changes Hours of synthesis-place-route time International Symposium on Network-on-Chip
DART: Hybrid Approach FPGA UART Control FSM DART Simulator configuration, commands PC Simulation results • Generic NoC simulation engine • Fixed function nodes for basic NoC building blocks • Router, traffic generator, link • Software configurable parameters in each node Simulate different NoCs without changing hardware International Symposium on Network-on-Chip
Why yet another NoC simulator? International Symposium on Network-on-Chip
DART Simulator Architecture International Symposium on Network-on-Chip
Generic NoC Model Global interconnect • Topology • Routing algorithm • Flow control • Router microarchitecture • Simulated traffic • Link properties Router Traffic Generator Flit Queue International Symposium on Network-on-Chip
Global Timer DART Architecture Synchronize all network transfers to a global time counter International Symposium on Network-on-Chip
DART Nodes • Parameters implemented using a shift register • Configuration byte stream generated on the PC and sent to the FPGA International Symposium on Network-on-Chip
Simulating a NoC • Map simulated NoC to DART nodes • Program the routing tables to implement the simulated topology • Record timing of flit transfers International Symposium on Network-on-Chip
Example Walk-Through Global Timer Global Interconnect 0 1 2 3 4 5 6 7 International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 Traffic Generator Router Flit Queues Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip
Example Walk-Through 0 1 2 3 4 5 6 7 # received: 1 Σlatency: 6 # received: 1 Σlatency = 6 # injected: 1 # injected: 1 0 1 2 3 4 5 6 7 Global Interconnect 0 1 2 3 4 5 6 Global Timer International Symposium on Network-on-Chip
Router Input Port 0 Input port 1 Input port 2 Router Input port 3 Input Port 0 Input port 4 Input port 1 Routing Logic Allocator Input port 2 Input port 3 Input port 4 Routing Table Arbiter DART Router • Virtualizes the ports replace crossbar with MUX • No large switch allocators and crossbars • Routes 1 flit per DART cycle • N cycles for N ports • Input ports selected based on timestamp Multiplexing in time saves area International Symposium on Network-on-Chip
DART Summary • Configurable functional model of an NoC • Easy to modify and reuse • Fast by exploiting fine grained parallelism • Decouple simulated cycle from FPGA cycles • Trade simulation speed for area and programmability • Software configurable parameters • Familiar simulation flow and fast turn-around time International Symposium on Network-on-Chip
Evaluation & Results Overhead Architecture Scalability Implementation & Performance International Symposium on Network-on-Chip
Methodology • C++ Cycle-accurate architecture simulator • Explore various DART architectures • Evaluate performance trade-offs • 9-node implementation on a Virtex-II Pro FPGA • Baseline: Booksim 2.0 • Cycle-based software simulator (C++) • Metrics • Overhead: DART cycles/simulated cycle (CPS) • Performance: Thousands of simulated cycles per second International Symposium on Network-on-Chip
Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: • Router: International Symposium on Network-on-Chip
dedicated global x Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: dedicated vs. global • Router: International Symposium on Network-on-Chip
5-port 1-port Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: dedicated vs. global • Router: 5-port vs. 1-port International Symposium on Network-on-Chip
5-port dedicated Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: dedicated vs. global • Router: 5-port vs. 1-port • Baseline: dedicated+5-port • Benchmarks: 9-node mesh and 64-node mesh International Symposium on Network-on-Chip
Global interconnect + 1-ported router Router overhead dominates Overhead (2-6x) due to 1-port Router Overhead (2-3x) due to global interconnect Dedicated links + true 5-ported router Dedicated links + 1-ported router Global interconnect + 5-ported router Overhead: 9-node DART Lower Overhead Simulated 9-node DART International Symposium on Network-on-Chip
Simulated NoC saturates Global interconnect is the bottleneck Dedicated links + true 5-ported router Dedicated links + 1-ported router Global interconnect + 5-ported router Global interconnect + 1-ported router Overhead: 64-node DART Lower Overhead Simulated 64-node DART International Symposium on Network-on-Chip
Scalability • Compare DART’s performance scaling to Booksim beyond 9 nodes • 64-node DART with 8-partition global interconnect • Benchmarks: mesh sizes from 9 to 64 • DART performance extrapolated from architecture simulator assuming 50 MHz clock International Symposium on Network-on-Chip
Scalability: Mesh Benchmarks Faster 64-node DART Booksim DART simulation speed depends on network load only Higher speedups over Booksim for large NoCs International Symposium on Network-on-Chip
An Implementation of DART • 9 Nodes (max. that fit) • 8-partition interconnect • 50 MHz XUPV2P Development Board Virtex-II Pro XC2VP30 International Symposium on Network-on-Chip
Slower with more traffic 70x ~ 160x speedup Real Speed-up vs. Booksim Faster DART Speedup Booksim Large NoC simulations can become more interactive International Symposium on Network-on-Chip
Future Work • Virtualize DART nodes using multithreading • Further trade performance for area • Off-chip traffic generation • Integrate with full-system evaluation framework • Better coverage of the router design space • Adaptive routing, speculative routing, etc. • Investigate specialized soft processors International Symposium on Network-on-Chip
Summary • Software configurable FPGA-based NoC simulator is feasible • Area overhead vs. existing emulators is negligible • Over 100x speedup over software NoC simulator (Booksim) • Hardware and software tools available at http://www.eecg.toronto.edu/DART International Symposium on Network-on-Chip
Q & A Thank you! International Symposium on Network-on-Chip
Backup Slides • Classic Router Microarchitecture • Global Interconnect • DART Software Flow • Correctness Analysis • Interconnect Performance vs. Resource Utilization • DART vs. Booksim Speedup International Symposium on Network-on-Chip
Classic Router Microarchitecture Back International Symposium on Network-on-Chip
Global Interconnect Back International Symposium on Network-on-Chip
FPGA UART Control FSM DART Simulator DART Software • DARTgen • Placement of simulated nodes in DART partitions • Evenly distribute nodes across partitions to balance load • Generate configuration bytes • DARTportal • Communicates with the DART simulator on FPGA through serial port • Interactive Back International Symposium on Network-on-Chip
Correctness (1/2) • booksim: 5-cycle routing delay • booksim2: 4-cycle routing delay + 1-cycle switch allocation delay Back International Symposium on Network-on-Chip
0-hop packets 2 hops 3 hops 4 hops 1 hop Correctness (2/2) Booksim has longer tail Back International Symposium on Network-on-Chip
Interconnect Scalability (1/2) Flit injection rate = 0.5 Flit injection rate = 0.1 Back International Symposium on Network-on-Chip
Interconnect Scalability (2/2) Back International Symposium on Network-on-Chip