540 likes | 766 Views
The Microarchitecture of FPGA-Based Soft Processors. Peter Yiannacouras CARG - June 14, 2005. FPGA vs ASIC Flows. Reduced cost for low-volume Reduced time-to-market Programmability affords customization Designers use FPGAs!. ASIC Flow. FPGA Flow. Circuit Design. Circuit Design.
E N D
The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005
FPGA vs ASIC Flows • Reduced cost for low-volume • Reduced time-to-market • Programmability affords customization Designers use FPGAs! ASIC Flow FPGA Flow Circuit Design Circuit Design
Option 3: On-chip “soft” processor • Option 1: Off-chip processor • Option 2: On-chip “hard” processor FPGA FPGA FPGA Custom Logic Custom Logic Custom Logic Custom Logic Processor Processor Processor Processor Can implement any number of processors Increased board area, cost, and latency Specialized part, lack of flexibility Tune each one to meet design constraints Processors and FPGAs Custom Logic Processor
Tuning Soft Processors Automatically Tuning Soft Processors Application, Design constraints Application, Design constraints • 1700 LEs • 160 MHz • 6-stage pipeline • 1700 LEs • 160 MHz • 6-stage pipeline • 500 LEs • 40 MHz • 2-stage pipeline • 500 LEs • 40 MHz • 2-stage pipeline • your area, speed, • power tradeoff Tuning Processors • $3 • 4 MHz • 800 mW • 2-stage pipeline Application, Design constraints • $300 • 3.8 GHz • 80 W • 31-stage pipeline
Understanding Soft Processors Architecture Description • Tuning requires understanding of soft processor design space • We implement many processors and study the design space Synthesized Processor • Area • Performance • Power
Don’t we already understand architecture? • Not completely • We can evaluate area, power, performance • Not accurately (rules of thumb) • FPGA CAD tools are very accurate • Not in the FPGA domain • LUTs vs transistors • relative speed of RAM & Multipliers
Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s)
FPGA Flow Circuit Design (RTL) Measurement Methodology • Require a set of metrics Performance Power Area • Resource Usage • Clock Frequency • Power estimate
Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area) Area Logic Elements (LEs – LUT & flip flop) Big RAM Little RAM Multipliers Medium RAM
CAD Tool From RTL Simulation, Averaged over 20 benchmarks: Source Benchmark MiBench bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patricia Freescale Dhrystone 2.1 Xirisc bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlc RATEs dct, gol Performance • Wall Clock Time = #Cycles * Clock Period
Total Dynamic Power (mW) Dynamic Energy excluding I/O per cycle (nJ/cycle) = Power • CAD tool can estimate power from assumed toggle ratio (derived experimentally) ÷ Clock Frequency (MHz)
Metrics summary • Require the following information • Resource Usage (area – CAD Tool) • Clock Frequency (wall clock time – CAD Tool) • Power Estimate (energy/cycle – CAD Tool) • Cycle Count (wall clock time – RTL Simulator)
Complete and accurate understanding of design space RTL-based Design Space Exploration Circuit Design (RTL) Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power
Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s)
Need fast route to RTL from architectural idea Microarchitectural Design Space Exploration Circuit Design (RTL) Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power
SPREE (Soft Processor Rapid Exploration Environment) SPREE RTL Generator Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power
SPREE Goals • Develop measurement methodology • Populate the design space • Rapidly • With interesting designs • Accurately (minimize overhead) • Compare against industrial soft processor(s)
Related Work • Parametrized Cores • Narrow design space, laborious changes to control • Architecture Description Languages (ADLs) • Too robust, inaccurate (simulator based, or behavioural RTL) • PEAS-III/ASIPMeister [Itoh2000] • non-fpga specific, ISA design focus
Rapidly simple descriptions Interesting Allows for interesting architectures Accurately efficient component implementations SPREE RTL Generator Overview ISA Description Datapath Description Component Library SPREE RTL Generator Efficiently Synthesizable RTL
Some current limitations • No caches (use fast on-chip RAM) • Simple in-order issue pipelines • No dynamic branch prediction • No OS or exceptions support • No ISA changes! • Need compiler generation to support • Use subset of MIPS-I
Data Mem Data Mem Data Mem Ifetch Reg File Ifetch Ifetch Reg File Reg File Mul Mul Write Back Write Back ALU ALU Architecture Input Mul Write Back ALU Component Library
Ifetch Data Mem Data Mem Data Mem Ifetch Ifetch Reg File Reg File Mul Regfile Write Back ALU Mul Mul Write Back Write Back ALU ALU Architecture Input Component Library Datapath Description
Mul Mul IF IF Reg file Reg file Write Back Write Back ALU ALU Data Mem Data Mem Decode Decode Decode • Control generation saves • time and is non-critical Architecture Input ISA Description Datapath Description Ifetch Reg File Ifetch Reg File SPREE RTL Generator Mul Data Mem Mul Write Back ALU Write Back ALU Component Library
Architecture Input:ISA Description • Generic Operations (GENOPs) • MIPS instructions made of GENOPs GENOPs MIPS ADD – add rd, rs, rt FETCH FETCH RFREAD RFREAD RFREAD ADD ADD RFWRITE RFWRITE
Complete Experimental Framework Using SPREE FIXED ISA Description Datapath Description Component Library SPREE RTL Generator Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power
Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s) Performance SPREE Power Area
Altera’s NiosII • Second generation soft processor • Has three variations: • NiosIIe – unpipelined, no hardware multiply • NiosIIs – 5-stages, no branch prediction • NiosIIf – 6-stages, dynamic branch prediction • Caveats • Supports exceptions, OS, and caches • Very similar but tweaked ISA
Let’s explore some architecture! Summary • We span the design space • Remain competitive • Achieved 9% faster and 11% smaller than NiosIIs • => don’t suffer from prohibitive overhead
Architectural Axes • Hardware vs Software Multiplication • Shifter implementation • Pipeline • Depth • Organization • Forwarding
Hardware vs Software Multiplication • Hardware multiplication • Increases area & power consumption • Speeds up execution • BUT … • Not all applications care about speed • Not all applications use multiplication (significantly)
Must understand its cost/benefit to decide when to use Cycle Count Speedup of Hardware Multiplication
Cost of Hardware Multiply • ~250 LEs (20%) • 35% more Energy/cycle
Shifter Implementations • Shifters (multiplexers) are big in FPGAs • Consider 3 implementations: • Serial shifter • LUT-based barrel shifter • Multiplier-based barrel shifter
Impact of Shifter Implementation • Consistent across different pipe depths
Multiplier is very nice sweet spot Shifter Implementation Tradeoffs • Averaged over all pipeline depths • Smallest: Serial • Fastest: LUT-based barrel • Energy efficient: Serial
Pipelines - Depth • Study different pipeline depths • Over 3 shifters • Arrows = possible forwarding lines (not used) • All use predict not-taken branches
Impact of Pipelining • Adds area, can increase speed (2 to 3 stage?)
Stall on all loads, and any operand fetches Ifetch Data Mem FPGA Nuance: Synchronous RAMs 2-stage Pipeline Mul Regfile Write Back ALU
Ifetch Data Mem 3-stage Pipeline • Less stalls, increased frequency => Big speedup (1.7x) Mul Regfile Write Back ALU
3, 4 and 5 stage pipelines • Increased area, small change in performance => Deeper pipelines have potential for better speedups
X X Never squash this stage The 7-stage Pipeline Where Branch Delay Slots break down • The ideal case: … OR JR ADD BEQ
Problem: Separation of Branch and Branch Delay Slot Stalls on RAW hazard … JR ADD BEQ
Problem: Separation of Branch and Branch Delay Slot X … JR ADD NOP BEQ • Must track and protect delay slots
Better off eliminating delay slots – currently researching Multiple Delay Slots • Must detect separation of branch from delay slot • OR prevent multiple delay slots • Stall branch if a delay slot exists in the pipe • We did this one (+30LEs, -15% clock frequency) • Can’t guard all delay slots … OR JR ADD BEQ
Pipeline organization • Where stages are placed is important • Pipe stage placement can • Result in all around “win/loss” • Present a tradeoff
Ifetch Data Mem Forward line rs Mul Reg File Write Back ALU Forward line rt Forwarding • SPREE supports stage to stage forwarding
20% speed increase Effect of Forwarding
An Aside: ISA Subsetting • Applications don’t generally use all instructions