The Microarchitecture of FPGA-Based Soft Processors

The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

FPGA vs ASIC Flows • Reduced cost for low-volume • Reduced time-to-market • Programmability affords customization Designers use FPGAs! ASIC Flow FPGA Flow Circuit Design Circuit Design

Option 3: On-chip “soft” processor • Option 1: Off-chip processor • Option 2: On-chip “hard” processor FPGA FPGA FPGA Custom Logic Custom Logic Custom Logic Custom Logic Processor Processor Processor Processor Can implement any number of processors Increased board area, cost, and latency Specialized part, lack of flexibility Tune each one to meet design constraints Processors and FPGAs Custom Logic Processor

Tuning Soft Processors Automatically Tuning Soft Processors Application, Design constraints Application, Design constraints • 1700 LEs • 160 MHz • 6-stage pipeline • 1700 LEs • 160 MHz • 6-stage pipeline • 500 LEs • 40 MHz • 2-stage pipeline • 500 LEs • 40 MHz • 2-stage pipeline • your area, speed, • power tradeoff Tuning Processors • $3 • 4 MHz • 800 mW • 2-stage pipeline Application, Design constraints • $300 • 3.8 GHz • 80 W • 31-stage pipeline

Understanding Soft Processors Architecture Description • Tuning requires understanding of soft processor design space • We implement many processors and study the design space Synthesized Processor • Area • Performance • Power

Don’t we already understand architecture? • Not completely • We can evaluate area, power, performance • Not accurately (rules of thumb) • FPGA CAD tools are very accurate • Not in the FPGA domain • LUTs vs transistors • relative speed of RAM & Multipliers

Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s)

FPGA Flow Circuit Design (RTL) Measurement Methodology • Require a set of metrics Performance Power Area • Resource Usage • Clock Frequency • Power estimate

Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area) Area Logic Elements (LEs – LUT & flip flop) Big RAM Little RAM Multipliers Medium RAM

CAD Tool From RTL Simulation, Averaged over 20 benchmarks: Source Benchmark MiBench bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patricia Freescale Dhrystone 2.1 Xirisc bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlc RATEs dct, gol Performance • Wall Clock Time = #Cycles * Clock Period

Total Dynamic Power (mW) Dynamic Energy excluding I/O per cycle (nJ/cycle) = Power • CAD tool can estimate power from assumed toggle ratio (derived experimentally) ÷ Clock Frequency (MHz)

Metrics summary • Require the following information • Resource Usage (area – CAD Tool) • Clock Frequency (wall clock time – CAD Tool) • Power Estimate (energy/cycle – CAD Tool) • Cycle Count (wall clock time – RTL Simulator)

Complete and accurate understanding of design space RTL-based Design Space Exploration Circuit Design (RTL) Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s)

Need fast route to RTL from architectural idea Microarchitectural Design Space Exploration Circuit Design (RTL) Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

SPREE (Soft Processor Rapid Exploration Environment) SPREE RTL Generator Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

SPREE Goals • Develop measurement methodology • Populate the design space • Rapidly • With interesting designs • Accurately (minimize overhead) • Compare against industrial soft processor(s)

Related Work • Parametrized Cores • Narrow design space, laborious changes to control • Architecture Description Languages (ADLs) • Too robust, inaccurate (simulator based, or behavioural RTL) • PEAS-III/ASIPMeister [Itoh2000] • non-fpga specific, ISA design focus

Rapidly simple descriptions Interesting Allows for interesting architectures Accurately efficient component implementations SPREE RTL Generator Overview ISA Description Datapath Description Component Library SPREE RTL Generator Efficiently Synthesizable RTL

Some current limitations • No caches (use fast on-chip RAM) • Simple in-order issue pipelines • No dynamic branch prediction • No OS or exceptions support • No ISA changes! • Need compiler generation to support • Use subset of MIPS-I

Data Mem Data Mem Data Mem Ifetch Reg File Ifetch Ifetch Reg File Reg File Mul Mul Write Back Write Back ALU ALU Architecture Input Mul Write Back ALU Component Library

Ifetch Data Mem Data Mem Data Mem Ifetch Ifetch Reg File Reg File Mul Regfile Write Back ALU Mul Mul Write Back Write Back ALU ALU Architecture Input Component Library Datapath Description

Mul Mul IF IF Reg file Reg file Write Back Write Back ALU ALU Data Mem Data Mem Decode Decode Decode • Control generation saves • time and is non-critical Architecture Input ISA Description Datapath Description Ifetch Reg File Ifetch Reg File SPREE RTL Generator Mul Data Mem Mul Write Back ALU Write Back ALU Component Library

Architecture Input:ISA Description • Generic Operations (GENOPs) • MIPS instructions made of GENOPs GENOPs MIPS ADD – add rd, rs, rt FETCH FETCH RFREAD RFREAD RFREAD ADD ADD RFWRITE RFWRITE

Complete Experimental Framework Using SPREE FIXED ISA Description Datapath Description Component Library SPREE RTL Generator Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s) Performance SPREE Power Area

Altera’s NiosII • Second generation soft processor • Has three variations: • NiosIIe – unpipelined, no hardware multiply • NiosIIs – 5-stages, no branch prediction • NiosIIf – 6-stages, dynamic branch prediction • Caveats • Supports exceptions, OS, and caches • Very similar but tweaked ISA

Design Space vs NiosII Variations

Let’s explore some architecture! Summary • We span the design space • Remain competitive • Achieved 9% faster and 11% smaller than NiosIIs • => don’t suffer from prohibitive overhead

Architectural Axes • Hardware vs Software Multiplication • Shifter implementation • Pipeline • Depth • Organization • Forwarding

Hardware vs Software Multiplication • Hardware multiplication • Increases area & power consumption • Speeds up execution • BUT … • Not all applications care about speed • Not all applications use multiplication (significantly)

Must understand its cost/benefit to decide when to use Cycle Count Speedup of Hardware Multiplication

Cost of Hardware Multiply • ~250 LEs (20%) • 35% more Energy/cycle

Shifter Implementations • Shifters (multiplexers) are big in FPGAs • Consider 3 implementations: • Serial shifter • LUT-based barrel shifter • Multiplier-based barrel shifter

Impact of Shifter Implementation • Consistent across different pipe depths

Multiplier is very nice sweet spot Shifter Implementation Tradeoffs • Averaged over all pipeline depths • Smallest: Serial • Fastest: LUT-based barrel • Energy efficient: Serial

Pipelines - Depth • Study different pipeline depths • Over 3 shifters • Arrows = possible forwarding lines (not used) • All use predict not-taken branches

Pipelining & clock frequency

Impact of Pipelining • Adds area, can increase speed (2 to 3 stage?)

Stall on all loads, and any operand fetches Ifetch Data Mem FPGA Nuance: Synchronous RAMs 2-stage Pipeline Mul Regfile Write Back ALU

Ifetch Data Mem 3-stage Pipeline • Less stalls, increased frequency => Big speedup (1.7x) Mul Regfile Write Back ALU

3, 4 and 5 stage pipelines • Increased area, small change in performance => Deeper pipelines have potential for better speedups

X X Never squash this stage The 7-stage Pipeline Where Branch Delay Slots break down • The ideal case: … OR JR ADD BEQ

Problem: Separation of Branch and Branch Delay Slot Stalls on RAW hazard … JR ADD BEQ

Problem: Separation of Branch and Branch Delay Slot X … JR ADD NOP BEQ • Must track and protect delay slots

Better off eliminating delay slots – currently researching Multiple Delay Slots • Must detect separation of branch from delay slot • OR prevent multiple delay slots • Stall branch if a delay slot exists in the pipe • We did this one (+30LEs, -15% clock frequency) • Can’t guard all delay slots … OR JR ADD BEQ

Pipeline organization • Where stages are placed is important • Pipe stage placement can • Result in all around “win/loss” • Present a tradeoff

Ifetch Data Mem Forward line rs Mul Reg File Write Back ALU Forward line rt Forwarding • SPREE supports stage to stage forwarding

20% speed increase Effect of Forwarding

An Aside: ISA Subsetting • Applications don’t generally use all instructions

The Microarchitecture of FPGA-Based Soft Processors

The Microarchitecture of FPGA-Based Soft Processors

Presentation Transcript

The Microarchitecture Level

Conjoining Soft-Core FPGA Processors

Soft Vector Processors with Streaming Pipelines

The Microarchitecture Level

Application-Specific Customization of Parameterized FPGA Soft-Core Processors

SRAM-based FPGA

OCCBIO 2007 Tutorial on FPGA-Acceleration Processors

The Microarchitecture of FPGA-Based Soft Processors

The Microarchitecture Level

GSRC Soft-Systems Vision of Long-term Microarchitecture Research

FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS

Automatic Application-Specific Customization of Soft Processor Microarchitecture

Application-Specific Customization of FPGA Soft-core Processors

Application-Specific Customization of Soft Processor Microarchitecture

Custom Code Generation for Soft Processors

Improving Pipelined Soft Processors with Multithreading

Improving Soft-error Tolerance of FPGA Configuration Bits

Microarchitecture

Fine-Grain Performance Scaling of Soft Vector Processors

Microarchitecture

Conjoining Soft-Core FPGA Processors

Application-Specific Customization of Soft Processor Microarchitecture