A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs †MIT Computer Science and AI Lab Computation Structures Group {pellauer, vmurali, arvind} @csail.mit.edu ‡Intel VSSAD Group {michael.adler, joel.emer} @intel.com Michael Pellauer† Muralidaran Vijayaraghavan† Michael Adler‡ Arvind† Joel Emer†‡

Introduction Interest to use FPGAs for Performance Models HAsim: Re-implement Intel Asim simulator on FPGA (this talk) Other Projects: Liberty, UT-FAST Performance modeling efforts in RAMP FPGAs used here Alternative to ASIC tapeout FPGAs used here Create prototype for verification • The modern circuit design flow: Specify System Requirements Explore Architecture Alternatives Write Circuit RTL Verify Physical Manufacture

Performance Models Simulation speed Accuracy Development Time • Created early in the design process • Drive architectural exploration, feasibility analysis • parameterization, ease of change • Shows how many clock cycles an operation takes • Does not show what clock cycle time is • Homebrewed C or SystemC: synchronous simulation • The software performance modeling crisis: FPGAs can help! • Performance models have high degree of parallelism within a model clock cycle • It’s all modeling gates • But many interesting circuits have no good implementations on LUT-based FPGAs • CAMs, many-ported register files, nested MUXes, etc • The solution: • Configure FPGA into circuit simulator • Virtualize the FPGA clock

Performance Model on FPGA • Implemented using Bluespec SystemVerilog • Xilinx Virtex IIPro 70 using Xilinx ISE 8.1i • Clock speed != simulation speed • Uses 1 execution unit to simulate 4 parallel execution units • Sequentially searches BlockRAM to simulate parallel CAM • Ends up taking about 15.6 FPGA cycles per model cycle • Result: 95 / 15.6 = 6 MHz simulation rate

Example Target • Register File with 2 Read Ports, 2 Write Ports • Reads take zero clock cycles • Direct configuration onto FPGA: 9242 slices, 104 MHz

Example as Performance Model • Simulate the circuit using synchronous BlockRAM • First do reads, then serialize writes • Only update model time when all requests are serviced • Results: 94 slices, 1 BlockRAM, 224 MHz • Simulation rate is 224 / 3 = 75 MHz (FPGA-to-Model Ratio) Separated model clock from FPGA clock How do we compose these modules into a correct, efficient system? Let’s examine how a software performance model does it

Time in Software Asim Model FET DEC EXE MEM WB 1 • Software has no inherent clock • Model time is tracked via Asim “Ports” • All communication goes through Ports • Ports have a model time latency for messages • Execution model: for each module in system • Simulate a model cycle for that module • Reads all input Ports, writes all output Ports • Can write special “NoMessage” value to indicate no activity • FPGA: Can simulate in parallel instead of sequentially 1 1 1 1 2

Barrier Synchronization Controller curCC FET DEC EXE MEM WB • Controller tracks current Model CC • Tells all modules “begin” • Modules copy input, compute, write output, say “done” • When all are done, increment cycle count and repeat • More fine-grained parallelism than parallel software models • FGPA-to-Model Ratio: Dynamic worst case • But what about clock rate?

The problem with Barrier Sync … • Becomes critical path with large number of modules • Massive fan-in/fan-out from controller • Quickly becomes critical path • Experiment: Linear topology of modules Could pipeline or cluster, but we can do better

A-Ports: Asim Ports on an FPGA FET DEC EXE MEM WB 1 1 1 1 1 2 • Scalable: Distributed control, no combinational paths, no counters • Port with n latency starts with n NoMessage in it • Each module may proceed in a “dataflow” manner • Start a cycle whenever all inputs are available • Compute for any number of FPGA cycles • Stall if output ports are full • A module can derive the current model cycle by counting • Adjacent modules may not be on the same model cycle….

Modules can “slip” in model time FET DEC EXE MEM WB 1 1 1 1 1 2 • Observation: when port has n messages in it • Producer and Consumer are on same model cycle • Producers can run ahead, prebuffering data • Consumers can run ahead, draining data • Still works with backwards paths • With proper buffering we can get average number of FPGA cycles per model cycle • Much better than worst case a la Barrier

Example: MIPS R10K-like Processor • 4-Way Superscalar, Out-of-order Issue

Results: OOO Simulator

Takeaways • Performance Modeling on FPGAs shows great potential • Cycle-accurate simulation in MHz vs KHz • A-Ports: • Distributed, efficient tracking of time that scales • Manages dynamic “slip” in model time • Dynamic average case instead of worst case • In paper: a technique to resynchronize modules to the same model clock cycle • Underway: Effort to model realistic multicore systems • Future Work: Combine with Chung-style virtualization [FPGA 2008] to eliminate pipeline stalls

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

A-Ports: A Distributed, Efficient Technique for Performance Models on FPGAs

Presentation Transcript

Djibouti Ports – DP World Presented By Jér ô me Martins Oliveira CHIEF EXECUTIVE OFFICER

Models of Human Performance

Distributed Objects and Components

Chapter 23

Designing Distributed Applications using Mobile Agents

AromaTouch ™ Hand Technique

Chapter 22: Distributed Databases

Part 2 Distributed Systems 2009

Chapter 19: Distributed Databases

Distributed Systems: Distributed algorithms

Distributed Parallel Computing

Distributed Databases

Econ 240 C

The Evolution of Human-Performance Modeling Techniques for Usability

Chapter 13 Generalized Linear Models

Implementing Complex Algorithms in FPGAs

DISTRIBUTED COMPUTING

Parallel and Distributed Algorithms Spring 2007

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Econ 240 C

Distributed Database Systems