The Microarchitecture of FPGA-Based Soft Processors

The Microarchitectureof FPGA-Based Soft Processors Peter Yiannacouras Jonathan Rose Greg Steffan University of Toronto Electrical and Computer Engineering

FPGA Our goal is to study the architecture of soft processors Processors and FPGAs • Processors present in many digital systems Processor Custom Logic • Soft processors - implemented in FPGA fabric

Motivation for understanding soft processor architecture • Soft processors are popular • 16% of FPGA designs use a soft processor • FPGA Journal, November 2003 • This number has and will continue to increase • Soft processors are end-user customizable • Application-specific architectural tradeoffs • Can be tuned by designers

Must revisit processor architecture in FPGA context Don’t we already understand processor architecture? • Not accurately/completely • Accurate cycle-to-cycle behaviour • Estimated area/power • No clock frequency impact • Not in FPGA domain • Lookup tables vs transistors • Dedicated RAMs and Multipliers fast

Explore soft processor architecture experimentally Research Goals • Generate soft processor implementations • System for generating RTL • Develop measurement methodology • Metrics for comparing soft processors • Develop understanding of architectural tradeoffs • Analyze area/performance/power space

ISA • Datapath SPREE RTL Soft Processor Rapid Exploration Environment (SPREE)

RTL ISA currently fixed (subset of MIPS I) Input: Instruction Set Architecture (ISA) Description • Graph of Generic Operations (GENOPs) • Edges indicate flow of data • ISA • Datapath MIPS ADD – add rd, rs, rt FETCH SPREE RFREAD RFREAD ADD RFWRITE

Mul Ifetch Reg file Write Back ALU RTL Data Mem Limited to simple in-order issue pipelines Input: Datapath Description • Interconnection of hand-coded components • Allows efficient synthesis • Described using C++ • ISA • Datapath Ifetch Reg File Ifetch Reg File SPREE Mul Data Mem Mul Shifter ALU Write Back ALU SPREE Component Library

Mul RTL Reg File Ifetch Write Back RFREAD FETCH ALU ADD RFREAD RFWRITE Data Mem Step 1.ISA vs Datapath Verification • ISA • Datapath • Components described using GENOPs Verify FETCH SPREE RFREAD RFREAD ADD RFWRITE

Mul RTL Reg File Ifetch Write Back ALU Data Mem Step 2.Datapath Instantiation • ISA • Datapath • Multiplexer insertion • Unused connection/component removal SPREE

RTL Laborious step performed automatically Step 3.Control Generation • ISA • Datapath Control Control Control Control Mul Reg File Ifetch Write Back SPREE ALU Data Mem

Output: Verilog RTL Description • ISA • Datapath Verilog RTL Control Control Control Control Mul Reg File SPREE Ifetch Write Back ALU RTL Data Mem

RTL In this work we can measure each accurately! Back-end Infrastructure Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator Quartus II 4.2 CAD Software Stratix 1S40 • Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power

Metrics for Measurement • Area: Equivalent Stratix Logic Elements (LEs) • Relative silicon areas used for RAMs/Multipliers • Performance: Wall clock time • Cycle count ÷ clock frequency • Arithmetic mean across benchmark set • Energy: Dynamic Energy (eg. nJ/instr) • Excluding I/O

All generated soft processors are verified this way Trace-Based Verification • Ensure SPREE generates functional processors Trace RTL 110100 101011 111101 Modelsim (RTL Simulator)  Compare Benchmark Applications Trace  MINT (Instruction-set Simulator) 110100 101011 111101

Architectural Exploration Results

Architectural Features Explored • Hardware vs software multiplication • Shifter implementation • Pipelining • Depth • Organization • Forwarding

We believe the comparison is meaningful Validation of SPREE Through Comparison to Altera’s Nios II • Has three variations: • Nios II/e – unpipelined, no HW multiplier • Nios II/s – 5-stage, with HW multiplier • Nios II/f – 6-stage, dynamic branch prediction • Caveats – not completely fair comparison • Very similar but tweaked ISA • Nios II Supports exceptions, OS, and caches • We do not and save on the hardware costs

Competitive and can dominate (9% smaller, 11% faster) SPREE vs Nios II faster • 3-stage pipe • HW multiply • Multiply-based • shifter smaller

Architectural Features Explored • Hardware vs software multiplication • Shifter implementation • Pipelining • Depth • Organization • Forwarding

Total energy wasted if few multiply instructions, saved if many Hardware vs Software Multiplication • Hardware multiply is fast but not always needed • Wastes area (220 LEs) and can waste energy

Shifter Implementation • Shifters are expensive in FPGAs • We explore three implementations: • Serial shifter (shift register) • Multiplier-based barrel shifter (hard multiplier) • LUT-based barrel shifter (multiplexer tree)

Multplier-based shifter is a good compromise Performance-Area of Different Shifter Implementations faster smaller

Pipeline Depth • Explored between 2 and 7 stages • 1-stage and 6-stage pipeline not interesting F/D/R/EX/M WB 2-stage F/D R/EX/M WB 3-stage F D R/EX/M WB 4-stage F D R/EX EX/M WB 5-stage F D R EX EX EX/M WB (new) 7-stage

2-stage pipeline and 7-stage pipeline suffers from nuances 3,4, and 5-stage pipelines perform the same Pipeline Depth and Performance

4-stage (B) is 15% faster but requires up to 70 more LEs Pipeline Organization Tradeoff 4-stage (A) F D R/EX/M WB 4-stage (B) F/D R/EX EX/M WB

Pipeline Forwarding • Prevent stalls when data hazards occur • MIPS has two source operands (rs & rt) • Four forwarding configuration are possible: • No forwarding • Forward rs • Forward rt • Forward both rs and rt F D/R EX M WB

9% 20% Up to 20% speed improvement for both operands The rs operand benefits more than rt (9% faster) Pipeline Forwarding

Summary of Presented Architectural Conclusions • Hardware multiplication can be wasteful • Multiplier-based shifter is a sweet spot • 3-stage pipelines are attractive • Tradeoffs exist within pipeline organization • Forwarding • Improves performance by 20% • Favours the rs operand

Future Work • Explore other exciting architectural axes • Branch prediction, aggressive forwarding • ISA changes • VLIW datapaths • Caches and memory hierarchy • Compiler optimizations • Port to other devices • Explore aggressive customization • Add exceptions and OS support

The Microarchitecture of FPGA-Based Soft Processors

The Microarchitecture of FPGA-Based Soft Processors

Presentation Transcript

The Microarchitecture Level

Conjoining Soft-Core FPGA Processors

Soft Vector Processors with Streaming Pipelines

The Microarchitecture Level

Application-Specific Customization of Parameterized FPGA Soft-Core Processors

SRAM-based FPGA

OCCBIO 2007 Tutorial on FPGA-Acceleration Processors

The Microarchitecture Level

GSRC Soft-Systems Vision of Long-term Microarchitecture Research

FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS

Automatic Application-Specific Customization of Soft Processor Microarchitecture

The Microarchitecture of FPGA-Based Soft Processors

Application-Specific Customization of FPGA Soft-core Processors

Application-Specific Customization of Soft Processor Microarchitecture

Custom Code Generation for Soft Processors

Improving Pipelined Soft Processors with Multithreading

Improving Soft-error Tolerance of FPGA Configuration Bits

Microarchitecture

Fine-Grain Performance Scaling of Soft Vector Processors

Microarchitecture

Conjoining Soft-Core FPGA Processors

Application-Specific Customization of Soft Processor Microarchitecture