1 / 27

ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation. Overview. Background Rent’s Rule Overcoming pin limitations through scheduling “Virtual wires” implementation Veloce 2 DEEP. The Challenge. Making a large multi-FPGA system is easy. Making it programmable is hard.

ashanti
Download Presentation

ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 636Reconfigurable ComputingLecture 13Logic Emulation

  2. Overview • Background • Rent’s Rule • Overcoming pin limitations through scheduling • “Virtual wires” implementation • Veloce 2 • DEEP

  3. The Challenge • Making a large multi-FPGA system is easy. Making it programmable is hard. • New approach is a software technology that facilitates hardware implementation. • Effectively make a large number of discrete devices look like one large one. • Leads to low-cost, scalable multi-FPGA substrate.

  4. Logic Emulation • Emulation takes a sizable amount of resources • Compilation time can be large due to FPGA compiles • One application: also direct ties to other FPGA computing applications.

  5. Are Meshes Realistic? • The number of wires leaving a partition grows with Rent’s Rule P = KGB • Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5 • Effectively devices highly pin limited • What does this mean for meshes?

  6. Possible Device Scenarios • Rent’s Rule indicates that pin limited situation is getting worse. • Frequently some logic must be left unused leading to limited utilization • Perhaps this logic can be “reclaimed”

  7. Partition vs FPGA Pin Count • FPGAs don’t have enough pins • Problem may or may not get worse depending on “structured” design.

  8. Virtual Wires • Overcome pin limitations by multiplexing pins and signals • Schedule when communication will take place.

  9. Virtual Wires Software Flow • Global router enhanced to include scheduling and embedding. • Multiplexing logic synthesized from FPGA logic.

  10. FPGA 1 FPGA 2 FPGA 4 FPGA 3 A Simple Example

  11. Clocking Strategy • Evaluation and communication split into phases • Longest dependency path determines number of phases • Overall emulation performance

  12. Example Scheduling • Initial phase requires one uClk for computation, one for communication. • Second phase requires 2 communication uClks due to through hop. • Note this example assumed needed bandwidth was available.

  13. Routing Algorithm • For each phase, only some internal signals are ready for routing. • Routing resources between FPGAs may be considered channels. • Solution: Route signals use maze route for each phase. • If available bandwidth not present, delay signals until later phases.

  14. Worst Case Microcycle Count V >= max ( L*D, PC/Pf ) • Most designs dominated by latency bound. • If original design has been pipelined this is less of an issue L = critical path length D = network diameter PC = max circuit partition pin count Pf = FPGA pin count

  15. uCLK Improved Scheduling • Overlap computation and communication. • Effectively create a “data flow” of information • Schedule communication to happen as soon as possible • No need for phases.

  16. Physical Implementation • Small finite state machine encoded and placed in each FPGA • Current implementation is one-hot encoding.

  17. System Implementation • Low cost hardware • So simple a graduate student can build it

  18. Benchmark Designs • Sparcle – modified Sparc processor • 17K gates • 4,352 bits of memory • Emulated in circuit. • CMMU – cache controller for scalable multiprocessor system • 85K gates • Designed as gate array and optimized with SIS • Palindrome • 14K gates • systolic

  19. Emulation Results • At least 31 FPGAs needed for HW full connectivity (>100 for torus) • Some degradation in overall system performance.

  20. Device Utilization • Approximately 45% of CLBs used for design logic. • ~10% virtual wires overhead

  21. Utilizations • As devices scale projected utilization increases • Hardwired approach doesn’t scale • Equation ->

  22. Veloce 2 • Emulates up to 2 billion logic gates using 128 boards of FPGAs • VirtuaLAB environment • Allows emulator sharing across a company (128 users) • Allows for emulation of popular interfaces (Ethernet, USB, PCIe) • Significant company investment • 40 MG per hour compile • 11 KW power consumption Veloce 2 Quattro Source: Mentor Graphics Veloce2 brochure

  23. Veloce 2 Execution Environment Source: Mentor Graphics Veloce2 brochure • Emulation environment contains three parts • Generation of test vectors and other stimulus • Design under test (DUT) and other support tools • Design checking resources (response checker)

  24. Veloce 2 Source: Mentor Graphics Veloce2 brochure • Emulator must interact with a number of interfaces

  25. A Contemporary Example: DEEP • Developed at U. Delaware in 2011 • Ten FPGA processor boards • Backplane used for communication in conjunction with switch boards • Ethernet interface • Somewhat limited scalability

  26. A Contemporary Example: DEEP • Used to emulate a multi-threaded processor • Subblocks assigned to specific FPGAs • The same logic is repetitively used for different threads • Hand-partitioning of subblocks • Bottom line: 80K cycles per second

  27. Summary • Virtual wires overcome pin limitations by intelligently multiplexing I/O signals • Key CAD step is scheduling. Simplifies routing and partitioning. • Latest push is towards incremental compilation • Commercialized by Mentor Graphics – Veloce2 • Contemporary multi-FPGA systems often contain many boards • Some require multiplexing of subblocks

More Related