1 / 14

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs. † MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu. ‡ Intel VSSAD Group {michael.adler, joel.emer} @intel.com. Michael Pellauer †

toni
Download Presentation

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs †MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu ‡Intel VSSAD Group {michael.adler, joel.emer} @intel.com Michael Pellauer† Muralidaran Vijayaraghavan† Michael Adler‡ Arvind† Joel Emer†‡

  2. Introduction Interest to use FPGAs for Performance Models HAsim: Re-implement Intel Asim simulator on FPGA (this talk) Other Projects: Liberty, UT-FAST Performance modeling efforts in RAMP FPGAs used here Alternative to ASIC tapeout FPGAs used here Create prototype for verification • The modern circuit design flow: Specify System Requirements Explore Architecture Alternatives Write Circuit RTL Verify Physical Manufacture

  3. Performance Models Simulation speed Accuracy Development Time • Created early in the design process • Drive architectural exploration, feasibility analysis • parameterization, ease of change • Shows how many clock cycles an operation takes • Does not show what clock cycle time is • Homebrewed C or SystemC: synchronous simulation • The software performance modeling crisis: FPGAs can help! • Performance models have high degree of parallelism within a model clock cycle • It’s all modeling gates • But many interesting circuits have no good implementations on LUT-based FPGAs • CAMs, many-ported register files, nested MUXes, etc • The solution: • Configure FPGA into circuit simulator • Virtualize the FPGA clock

  4. Performance Model on FPGA • Implemented using Bluespec SystemVerilog • Xilinx Virtex IIPro 70 using Xilinx ISE 8.1i • Clock speed != simulation speed • Uses 1 execution unit to simulate 4 parallel execution units • Sequentially searches BlockRAM to simulate parallel CAM • Ends up taking about 15.6 FPGA cycles per model cycle • Result: 95 / 15.6 = 6 MHz simulation rate

  5. Example Target • Register File with 2 Read Ports, 2 Write Ports • Reads take zero clock cycles • Direct configuration onto FPGA: 9242 slices, 104 MHz

  6. Example as Performance Model • Simulate the circuit using synchronous BlockRAM • First do reads, then serialize writes • Only update model time when all requests are serviced • Results: 94 slices, 1 BlockRAM, 224 MHz • Simulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio) Separated model clock from FPGA clock How do we compose these modules into a correct, efficient system? Let’s examine how a software performance model does it

  7. Time in Software Asim Model FET DEC EXE MEM WB 1 • Software has no inherent clock • Model time is tracked via Asim “Ports” • All communication goes through Ports • Ports have a model time latency for messages • Execution model: for each module in system • Simulate a model cycle for that module • Reads all input Ports, writes all output Ports • Can write special “NoMessage” value to indicate no activity • FPGA: Can simulate in parallel instead of sequentially 1 1 1 1 2

  8. Barrier Synchronization Controller curCC FET DEC EXE MEM WB • Controller tracks current Model CC • Tells all modules “begin” • Modules copy input, compute, write output, say “done” • When all are done, increment cycle count and repeat • More fine-grained parallelism than parallel software models • FGPA-to-Model Ratio: Dynamic worst case • But what about clock rate?

  9. The problem with Barrier Sync … • Becomes critical path with large number of modules • Massive fan-in/fan-out from controller • Quickly becomes critical path • Experiment: Linear topology of modules Could pipeline or cluster, but we can do better

  10. A-Ports: Asim Ports on an FPGA FET DEC EXE MEM WB 1 1 1 1 1 2 • Scalable: Distributed control, no combinational paths, no counters • Port with n latency starts with n NoMessage in it • Each module may proceed in a “dataflow” manner • Start a cycle whenever all inputs are available • Compute for any number of FPGA cycles • Stall if output ports are full • A module can derive the current model cycle by counting • Adjacent modules may not be on the same model cycle….

  11. Modules can “slip” in model time FET DEC EXE MEM WB 1 1 1 1 1 2 • Observation: when port has n messages in it • Producer and Consumer are on same model cycle • Producers can run ahead, prebuffering data • Consumers can run ahead, draining data • Still works with backwards paths • With proper buffering we can get average number of FPGA cycles per model cycle • Much better than worst case a la Barrier

  12. Example: MIPS R10K-like Processor • 4-Way Superscalar, Out-of-order Issue

  13. Results: OOO Simulator

  14. Takeaways • Performance Modeling on FPGAs shows great potential • Cycle-accurate simulation in MHz vs KHz • A-Ports: • Distributed, efficient tracking of time that scales • Manages dynamic “slip” in model time • Dynamic average case instead of worst case • In paper: a technique to resynchronize modules to the same model clock cycle • Underway: Effort to model realistic multicore systems • Future Work: Combine with Chung-style virtualization [FPGA 2008] to eliminate pipeline stalls

More Related