140 likes | 271 Views
A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs. † MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu. ‡ Intel VSSAD Group {michael.adler, joel.emer} @intel.com. Michael Pellauer †
E N D
A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs †MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu ‡Intel VSSAD Group {michael.adler, joel.emer} @intel.com Michael Pellauer† Muralidaran Vijayaraghavan† Michael Adler‡ Arvind† Joel Emer†‡
Introduction Interest to use FPGAs for Performance Models HAsim: Re-implement Intel Asim simulator on FPGA (this talk) Other Projects: Liberty, UT-FAST Performance modeling efforts in RAMP FPGAs used here Alternative to ASIC tapeout FPGAs used here Create prototype for verification • The modern circuit design flow: Specify System Requirements Explore Architecture Alternatives Write Circuit RTL Verify Physical Manufacture
Performance Models Simulation speed Accuracy Development Time • Created early in the design process • Drive architectural exploration, feasibility analysis • parameterization, ease of change • Shows how many clock cycles an operation takes • Does not show what clock cycle time is • Homebrewed C or SystemC: synchronous simulation • The software performance modeling crisis: FPGAs can help! • Performance models have high degree of parallelism within a model clock cycle • It’s all modeling gates • But many interesting circuits have no good implementations on LUT-based FPGAs • CAMs, many-ported register files, nested MUXes, etc • The solution: • Configure FPGA into circuit simulator • Virtualize the FPGA clock
Performance Model on FPGA • Implemented using Bluespec SystemVerilog • Xilinx Virtex IIPro 70 using Xilinx ISE 8.1i • Clock speed != simulation speed • Uses 1 execution unit to simulate 4 parallel execution units • Sequentially searches BlockRAM to simulate parallel CAM • Ends up taking about 15.6 FPGA cycles per model cycle • Result: 95 / 15.6 = 6 MHz simulation rate
Example Target • Register File with 2 Read Ports, 2 Write Ports • Reads take zero clock cycles • Direct configuration onto FPGA: 9242 slices, 104 MHz
Example as Performance Model • Simulate the circuit using synchronous BlockRAM • First do reads, then serialize writes • Only update model time when all requests are serviced • Results: 94 slices, 1 BlockRAM, 224 MHz • Simulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio) Separated model clock from FPGA clock How do we compose these modules into a correct, efficient system? Let’s examine how a software performance model does it
Time in Software Asim Model FET DEC EXE MEM WB 1 • Software has no inherent clock • Model time is tracked via Asim “Ports” • All communication goes through Ports • Ports have a model time latency for messages • Execution model: for each module in system • Simulate a model cycle for that module • Reads all input Ports, writes all output Ports • Can write special “NoMessage” value to indicate no activity • FPGA: Can simulate in parallel instead of sequentially 1 1 1 1 2
Barrier Synchronization Controller curCC FET DEC EXE MEM WB • Controller tracks current Model CC • Tells all modules “begin” • Modules copy input, compute, write output, say “done” • When all are done, increment cycle count and repeat • More fine-grained parallelism than parallel software models • FGPA-to-Model Ratio: Dynamic worst case • But what about clock rate?
The problem with Barrier Sync … • Becomes critical path with large number of modules • Massive fan-in/fan-out from controller • Quickly becomes critical path • Experiment: Linear topology of modules Could pipeline or cluster, but we can do better
A-Ports: Asim Ports on an FPGA FET DEC EXE MEM WB 1 1 1 1 1 2 • Scalable: Distributed control, no combinational paths, no counters • Port with n latency starts with n NoMessage in it • Each module may proceed in a “dataflow” manner • Start a cycle whenever all inputs are available • Compute for any number of FPGA cycles • Stall if output ports are full • A module can derive the current model cycle by counting • Adjacent modules may not be on the same model cycle….
Modules can “slip” in model time FET DEC EXE MEM WB 1 1 1 1 1 2 • Observation: when port has n messages in it • Producer and Consumer are on same model cycle • Producers can run ahead, prebuffering data • Consumers can run ahead, draining data • Still works with backwards paths • With proper buffering we can get average number of FPGA cycles per model cycle • Much better than worst case a la Barrier
Example: MIPS R10K-like Processor • 4-Way Superscalar, Out-of-order Issue
Takeaways • Performance Modeling on FPGAs shows great potential • Cycle-accurate simulation in MHz vs KHz • A-Ports: • Distributed, efficient tracking of time that scales • Manages dynamic “slip” in model time • Dynamic average case instead of worst case • In paper: a technique to resynchronize modules to the same model clock cycle • Underway: Effort to model realistic multicore systems • Future Work: Combine with Chung-style virtualization [FPGA 2008] to eliminate pipeline stalls