310 likes | 499 Views
The Microarchitecture of FPGA-Based Soft Processors. Peter Yiannacouras Jonathan Rose Greg Steffan University of Toronto Electrical and Computer Engineering. FPGA. Our goal is to study the architecture of soft processors. Processors and FPGAs. Processors present in many digital systems.
E N D
The Microarchitectureof FPGA-Based Soft Processors Peter Yiannacouras Jonathan Rose Greg Steffan University of Toronto Electrical and Computer Engineering
FPGA Our goal is to study the architecture of soft processors Processors and FPGAs • Processors present in many digital systems Processor Custom Logic • Soft processors - implemented in FPGA fabric
Motivation for understanding soft processor architecture • Soft processors are popular • 16% of FPGA designs use a soft processor • FPGA Journal, November 2003 • This number has and will continue to increase • Soft processors are end-user customizable • Application-specific architectural tradeoffs • Can be tuned by designers
Must revisit processor architecture in FPGA context Don’t we already understand processor architecture? • Not accurately/completely • Accurate cycle-to-cycle behaviour • Estimated area/power • No clock frequency impact • Not in FPGA domain • Lookup tables vs transistors • Dedicated RAMs and Multipliers fast
Explore soft processor architecture experimentally Research Goals • Generate soft processor implementations • System for generating RTL • Develop measurement methodology • Metrics for comparing soft processors • Develop understanding of architectural tradeoffs • Analyze area/performance/power space
ISA • Datapath SPREE RTL Soft Processor Rapid Exploration Environment (SPREE)
RTL ISA currently fixed (subset of MIPS I) Input: Instruction Set Architecture (ISA) Description • Graph of Generic Operations (GENOPs) • Edges indicate flow of data • ISA • Datapath MIPS ADD – add rd, rs, rt FETCH SPREE RFREAD RFREAD ADD RFWRITE
Mul Ifetch Reg file Write Back ALU RTL Data Mem Limited to simple in-order issue pipelines Input: Datapath Description • Interconnection of hand-coded components • Allows efficient synthesis • Described using C++ • ISA • Datapath Ifetch Reg File Ifetch Reg File SPREE Mul Data Mem Mul Shifter ALU Write Back ALU SPREE Component Library
Mul RTL Reg File Ifetch Write Back RFREAD FETCH ALU ADD RFREAD RFWRITE Data Mem Step 1.ISA vs Datapath Verification • ISA • Datapath • Components described using GENOPs Verify FETCH SPREE RFREAD RFREAD ADD RFWRITE
Mul RTL Reg File Ifetch Write Back ALU Data Mem Step 2.Datapath Instantiation • ISA • Datapath • Multiplexer insertion • Unused connection/component removal SPREE
RTL Laborious step performed automatically Step 3.Control Generation • ISA • Datapath Control Control Control Control Mul Reg File Ifetch Write Back SPREE ALU Data Mem
Output: Verilog RTL Description • ISA • Datapath Verilog RTL Control Control Control Control Mul Reg File SPREE Ifetch Write Back ALU RTL Data Mem
RTL In this work we can measure each accurately! Back-end Infrastructure Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator Quartus II 4.2 CAD Software Stratix 1S40 • Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power
Metrics for Measurement • Area: Equivalent Stratix Logic Elements (LEs) • Relative silicon areas used for RAMs/Multipliers • Performance: Wall clock time • Cycle count ÷ clock frequency • Arithmetic mean across benchmark set • Energy: Dynamic Energy (eg. nJ/instr) • Excluding I/O
All generated soft processors are verified this way Trace-Based Verification • Ensure SPREE generates functional processors Trace RTL 110100 101011 111101 Modelsim (RTL Simulator) Compare Benchmark Applications Trace MINT (Instruction-set Simulator) 110100 101011 111101
Architectural Features Explored • Hardware vs software multiplication • Shifter implementation • Pipelining • Depth • Organization • Forwarding
We believe the comparison is meaningful Validation of SPREE Through Comparison to Altera’s Nios II • Has three variations: • Nios II/e – unpipelined, no HW multiplier • Nios II/s – 5-stage, with HW multiplier • Nios II/f – 6-stage, dynamic branch prediction • Caveats – not completely fair comparison • Very similar but tweaked ISA • Nios II Supports exceptions, OS, and caches • We do not and save on the hardware costs
Competitive and can dominate (9% smaller, 11% faster) SPREE vs Nios II faster • 3-stage pipe • HW multiply • Multiply-based • shifter smaller
Architectural Features Explored • Hardware vs software multiplication • Shifter implementation • Pipelining • Depth • Organization • Forwarding
Total energy wasted if few multiply instructions, saved if many Hardware vs Software Multiplication • Hardware multiply is fast but not always needed • Wastes area (220 LEs) and can waste energy
Shifter Implementation • Shifters are expensive in FPGAs • We explore three implementations: • Serial shifter (shift register) • Multiplier-based barrel shifter (hard multiplier) • LUT-based barrel shifter (multiplexer tree)
Multplier-based shifter is a good compromise Performance-Area of Different Shifter Implementations faster smaller
Pipeline Depth • Explored between 2 and 7 stages • 1-stage and 6-stage pipeline not interesting F/D/R/EX/M WB 2-stage F/D R/EX/M WB 3-stage F D R/EX/M WB 4-stage F D R/EX EX/M WB 5-stage F D R EX EX EX/M WB (new) 7-stage
2-stage pipeline and 7-stage pipeline suffers from nuances 3,4, and 5-stage pipelines perform the same Pipeline Depth and Performance
4-stage (B) is 15% faster but requires up to 70 more LEs Pipeline Organization Tradeoff 4-stage (A) F D R/EX/M WB 4-stage (B) F/D R/EX EX/M WB
Pipeline Forwarding • Prevent stalls when data hazards occur • MIPS has two source operands (rs & rt) • Four forwarding configuration are possible: • No forwarding • Forward rs • Forward rt • Forward both rs and rt F D/R EX M WB
9% 20% Up to 20% speed improvement for both operands The rs operand benefits more than rt (9% faster) Pipeline Forwarding
Summary of Presented Architectural Conclusions • Hardware multiplication can be wasteful • Multiplier-based shifter is a sweet spot • 3-stage pipelines are attractive • Tradeoffs exist within pipeline organization • Forwarding • Improves performance by 20% • Favours the rs operand
Future Work • Explore other exciting architectural axes • Branch prediction, aggressive forwarding • ISA changes • VLIW datapaths • Caches and memory hierarchy • Compiler optimizations • Port to other devices • Explore aggressive customization • Add exceptions and OS support