1 / 61

DART: A Programmable Architecture for NoC Simulation on FPGAs

DART: A Programmable Architecture for NoC Simulation on FPGAs. Danyao Wang *† Natalie Enright Jerger * J. Gregory Steffan *. *Department of Electrical & Computer Engineering University of Toronto. †Google Inc. Why yet another NoC simulator?. Software simulators

ernie
Download Presentation

DART: A Programmable Architecture for NoC Simulation on FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger* J. Gregory Steffan* *Department of Electrical & Computer Engineering University of Toronto †Google Inc. International Symposium on Network-on-Chip

  2. Why yet another NoC simulator? • Software simulators • Stand-alone or integrated • Parallel NoC simulator (DARSIM) • FPGA-based Models • Direct map NoC emulators (Genko et al., NoCem) • Dynamic reconfiguration (DRNoC) • Decoupled timing and functional model (RAMPGold, ProtoFlex, A-Ports) • Analytical models: FIST International Symposium on Network-on-Chip

  3. Why yet another NoC simulator? @100KIPS: 1s of execution @ 1GHz = 10K sec = 2.8 hrs Benefits of thread-based parallelization is limited due to high synchronization overhead International Symposium on Network-on-Chip

  4. Why yet another NoC simulator? Orders of magnitude faster! Hardware changes Hours of synthesis-place-route time International Symposium on Network-on-Chip

  5. DART: Hybrid Approach FPGA UART Control FSM DART Simulator configuration, commands PC Simulation results • Generic NoC simulation engine • Fixed function nodes for basic NoC building blocks • Router, traffic generator, link • Software configurable parameters in each node Simulate different NoCs without changing hardware International Symposium on Network-on-Chip

  6. Why yet another NoC simulator? International Symposium on Network-on-Chip

  7. DART Simulator Architecture International Symposium on Network-on-Chip

  8. Generic NoC Model Global interconnect • Topology • Routing algorithm • Flow control • Router microarchitecture • Simulated traffic • Link properties Router Traffic Generator Flit Queue International Symposium on Network-on-Chip

  9. Global Timer DART Architecture Synchronize all network transfers to a global time counter International Symposium on Network-on-Chip

  10. DART Nodes • Parameters implemented using a shift register • Configuration byte stream generated on the PC and sent to the FPGA International Symposium on Network-on-Chip

  11. Simulating a NoC • Map simulated NoC to DART nodes • Program the routing tables to implement the simulated topology • Record timing of flit transfers International Symposium on Network-on-Chip

  12. Example Walk-Through Global Timer Global Interconnect 0 1 2 3 4 5 6 7 International Symposium on Network-on-Chip

  13. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 Traffic Generator Router Flit Queues Global Interconnect International Symposium on Network-on-Chip

  14. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 Global Interconnect International Symposium on Network-on-Chip

  15. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 Global Interconnect International Symposium on Network-on-Chip

  16. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 Global Interconnect International Symposium on Network-on-Chip

  17. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 Global Interconnect International Symposium on Network-on-Chip

  18. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 Global Interconnect International Symposium on Network-on-Chip

  19. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 Global Interconnect International Symposium on Network-on-Chip

  20. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 Global Interconnect International Symposium on Network-on-Chip

  21. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip

  22. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip

  23. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip

  24. Example Walk-Through Global Timer 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Global Interconnect International Symposium on Network-on-Chip

  25. Example Walk-Through 0 1 2 3 4 5 6 7 # received: 1 Σlatency: 6 # received: 1 Σlatency = 6 # injected: 1 # injected: 1 0 1 2 3 4 5 6 7 Global Interconnect 0 1 2 3 4 5 6 Global Timer International Symposium on Network-on-Chip

  26. Router Input Port 0 Input port 1 Input port 2 Router Input port 3 Input Port 0 Input port 4 Input port 1 Routing Logic Allocator Input port 2 Input port 3 Input port 4 Routing Table Arbiter DART Router • Virtualizes the ports  replace crossbar with MUX • No large switch allocators and crossbars • Routes 1 flit per DART cycle • N cycles for N ports • Input ports selected based on timestamp Multiplexing in time saves area International Symposium on Network-on-Chip

  27. DART Summary • Configurable functional model of an NoC • Easy to modify and reuse • Fast by exploiting fine grained parallelism • Decouple simulated cycle from FPGA cycles • Trade simulation speed for area and programmability • Software configurable parameters • Familiar simulation flow and fast turn-around time International Symposium on Network-on-Chip

  28. Evaluation & Results Overhead Architecture Scalability Implementation & Performance International Symposium on Network-on-Chip

  29. Methodology • C++ Cycle-accurate architecture simulator • Explore various DART architectures • Evaluate performance trade-offs • 9-node implementation on a Virtex-II Pro FPGA • Baseline: Booksim 2.0 • Cycle-based software simulator (C++) • Metrics • Overhead: DART cycles/simulated cycle (CPS) • Performance: Thousands of simulated cycles per second International Symposium on Network-on-Chip

  30. Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: • Router: International Symposium on Network-on-Chip

  31. dedicated global x Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: dedicated vs. global • Router: International Symposium on Network-on-Chip

  32. 5-port 1-port Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: dedicated vs. global • Router: 5-port vs. 1-port International Symposium on Network-on-Chip

  33. 5-port dedicated Programmability Overhead • Measure performance overhead of global interconnect and simplified Router model • Four combinations of two options • Interconnect: dedicated vs. global • Router: 5-port vs. 1-port • Baseline: dedicated+5-port • Benchmarks: 9-node mesh and 64-node mesh International Symposium on Network-on-Chip

  34. Global interconnect + 1-ported router Router overhead dominates Overhead (2-6x) due to 1-port Router Overhead (2-3x) due to global interconnect Dedicated links + true 5-ported router Dedicated links + 1-ported router Global interconnect + 5-ported router Overhead: 9-node DART Lower Overhead Simulated 9-node DART International Symposium on Network-on-Chip

  35. Simulated NoC saturates Global interconnect is the bottleneck Dedicated links + true 5-ported router Dedicated links + 1-ported router Global interconnect + 5-ported router Global interconnect + 1-ported router Overhead: 64-node DART Lower Overhead Simulated 64-node DART International Symposium on Network-on-Chip

  36. Scalability • Compare DART’s performance scaling to Booksim beyond 9 nodes • 64-node DART with 8-partition global interconnect • Benchmarks: mesh sizes from 9 to 64 • DART performance extrapolated from architecture simulator assuming 50 MHz clock International Symposium on Network-on-Chip

  37. Scalability: Mesh Benchmarks Faster 64-node DART Booksim DART simulation speed depends on network load only Higher speedups over Booksim for large NoCs International Symposium on Network-on-Chip

  38. An Implementation of DART • 9 Nodes (max. that fit) • 8-partition interconnect • 50 MHz XUPV2P Development Board Virtex-II Pro XC2VP30 International Symposium on Network-on-Chip

  39. Slower with more traffic 70x ~ 160x speedup Real Speed-up vs. Booksim Faster DART Speedup Booksim Large NoC simulations can become more interactive International Symposium on Network-on-Chip

  40. Future Work • Virtualize DART nodes using multithreading • Further trade performance for area • Off-chip traffic generation • Integrate with full-system evaluation framework • Better coverage of the router design space • Adaptive routing, speculative routing, etc. • Investigate specialized soft processors International Symposium on Network-on-Chip

  41. Summary • Software configurable FPGA-based NoC simulator is feasible • Area overhead vs. existing emulators is negligible • Over 100x speedup over software NoC simulator (Booksim) • Hardware and software tools available at http://www.eecg.toronto.edu/DART International Symposium on Network-on-Chip

  42. Q & A Thank you! International Symposium on Network-on-Chip

  43. Backup Slides • Classic Router Microarchitecture • Global Interconnect • DART Software Flow • Correctness Analysis • Interconnect Performance vs. Resource Utilization • DART vs. Booksim Speedup International Symposium on Network-on-Chip

  44. Classic Router Microarchitecture Back International Symposium on Network-on-Chip

  45. Global Interconnect Back International Symposium on Network-on-Chip

  46. FPGA UART Control FSM DART Simulator DART Software • DARTgen • Placement of simulated nodes in DART partitions • Evenly distribute nodes across partitions to balance load • Generate configuration bytes • DARTportal • Communicates with the DART simulator on FPGA through serial port • Interactive Back International Symposium on Network-on-Chip

  47. Correctness (1/2) • booksim: 5-cycle routing delay • booksim2: 4-cycle routing delay + 1-cycle switch allocation delay Back International Symposium on Network-on-Chip

  48. 0-hop packets 2 hops 3 hops 4 hops 1 hop Correctness (2/2) Booksim has longer tail Back International Symposium on Network-on-Chip

  49. Interconnect Scalability (1/2) Flit injection rate = 0.5 Flit injection rate = 0.1 Back International Symposium on Network-on-Chip

  50. Interconnect Scalability (2/2) Back International Symposium on Network-on-Chip

More Related