1 / 54

The Microarchitecture of FPGA-Based Soft Processors

The Microarchitecture of FPGA-Based Soft Processors. Peter Yiannacouras CARG - June 14, 2005. FPGA vs ASIC Flows. Reduced cost for low-volume Reduced time-to-market Programmability affords customization Designers use FPGAs!. ASIC Flow. FPGA Flow. Circuit Design. Circuit Design.

callie
Download Presentation

The Microarchitecture of FPGA-Based Soft Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Microarchitecture of FPGA-Based Soft Processors Peter Yiannacouras CARG - June 14, 2005

  2. FPGA vs ASIC Flows • Reduced cost for low-volume • Reduced time-to-market • Programmability affords customization Designers use FPGAs! ASIC Flow FPGA Flow Circuit Design Circuit Design

  3. Option 3: On-chip “soft” processor • Option 1: Off-chip processor • Option 2: On-chip “hard” processor FPGA FPGA FPGA Custom Logic Custom Logic Custom Logic Custom Logic Processor Processor Processor Processor Can implement any number of processors Increased board area, cost, and latency Specialized part, lack of flexibility Tune each one to meet design constraints Processors and FPGAs Custom Logic Processor

  4. Tuning Soft Processors Automatically Tuning Soft Processors Application, Design constraints Application, Design constraints • 1700 LEs • 160 MHz • 6-stage pipeline • 1700 LEs • 160 MHz • 6-stage pipeline • 500 LEs • 40 MHz • 2-stage pipeline • 500 LEs • 40 MHz • 2-stage pipeline • your area, speed, • power tradeoff Tuning Processors • $3 • 4 MHz • 800 mW • 2-stage pipeline Application, Design constraints • $300 • 3.8 GHz • 80 W • 31-stage pipeline

  5. Understanding Soft Processors Architecture Description • Tuning requires understanding of soft processor design space • We implement many processors and study the design space Synthesized Processor • Area • Performance • Power

  6. Don’t we already understand architecture? • Not completely • We can evaluate area, power, performance • Not accurately (rules of thumb) • FPGA CAD tools are very accurate • Not in the FPGA domain • LUTs vs transistors • relative speed of RAM & Multipliers

  7. Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s)

  8. FPGA Flow Circuit Design (RTL) Measurement Methodology • Require a set of metrics Performance Power Area • Resource Usage • Clock Frequency • Power estimate

  9. Measure physical area in Equivalent LEs (Eg. 9-bit multiplier is equivalent to 23 LEs in area) Area Logic Elements (LEs – LUT & flip flop) Big RAM Little RAM Multipliers Medium RAM

  10. CAD Tool From RTL Simulation, Averaged over 20 benchmarks: Source Benchmark MiBench bitcnts, CRC32, sha, stringsearch, FFT, dijkstra, patricia Freescale Dhrystone 2.1 Xirisc bubble_sort, crc, fft, fir, des, quant, iquant, turbo, vlc RATEs dct, gol Performance • Wall Clock Time = #Cycles * Clock Period

  11. Total Dynamic Power (mW) Dynamic Energy excluding I/O per cycle (nJ/cycle) = Power • CAD tool can estimate power from assumed toggle ratio (derived experimentally) ÷ Clock Frequency (MHz)

  12. Metrics summary • Require the following information • Resource Usage (area – CAD Tool) • Clock Frequency (wall clock time – CAD Tool) • Power Estimate (energy/cycle – CAD Tool) • Cycle Count (wall clock time – RTL Simulator)

  13. Complete and accurate understanding of design space RTL-based Design Space Exploration Circuit Design (RTL) Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

  14. Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s)

  15. Need fast route to RTL from architectural idea Microarchitectural Design Space Exploration Circuit Design (RTL) Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

  16. SPREE (Soft Processor Rapid Exploration Environment) SPREE RTL Generator Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

  17. SPREE Goals • Develop measurement methodology • Populate the design space • Rapidly • With interesting designs • Accurately (minimize overhead) • Compare against industrial soft processor(s)

  18. Related Work • Parametrized Cores • Narrow design space, laborious changes to control • Architecture Description Languages (ADLs) • Too robust, inaccurate (simulator based, or behavioural RTL) • PEAS-III/ASIPMeister [Itoh2000] • non-fpga specific, ISA design focus

  19. Rapidly simple descriptions Interesting Allows for interesting architectures Accurately efficient component implementations SPREE RTL Generator Overview ISA Description Datapath Description Component Library SPREE RTL Generator Efficiently Synthesizable RTL

  20. Some current limitations • No caches (use fast on-chip RAM) • Simple in-order issue pipelines • No dynamic branch prediction • No OS or exceptions support • No ISA changes! • Need compiler generation to support • Use subset of MIPS-I

  21. Data Mem Data Mem Data Mem Ifetch Reg File Ifetch Ifetch Reg File Reg File Mul Mul Write Back Write Back ALU ALU Architecture Input Mul Write Back ALU Component Library

  22. Ifetch Data Mem Data Mem Data Mem Ifetch Ifetch Reg File Reg File Mul Regfile Write Back ALU Mul Mul Write Back Write Back ALU ALU Architecture Input Component Library Datapath Description

  23. Mul Mul IF IF Reg file Reg file Write Back Write Back ALU ALU Data Mem Data Mem Decode Decode Decode • Control generation saves • time and is non-critical Architecture Input ISA Description Datapath Description Ifetch Reg File Ifetch Reg File SPREE RTL Generator Mul Data Mem Mul Write Back ALU Write Back ALU Component Library

  24. Architecture Input:ISA Description • Generic Operations (GENOPs) • MIPS instructions made of GENOPs GENOPs MIPS ADD – add rd, rs, rt FETCH FETCH RFREAD RFREAD RFREAD ADD ADD RFWRITE RFWRITE

  25. Complete Experimental Framework Using SPREE FIXED ISA Description Datapath Description Component Library SPREE RTL Generator Benchmarks RTL Simulator CAD Tool • Correctness • Cycle Count 3. Area 4. Clock Frequency 5. Power

  26. Goals • Develop measurement methodology • Populate the design space • Compare against industrial soft processor(s) Performance SPREE Power Area

  27. Altera’s NiosII • Second generation soft processor • Has three variations: • NiosIIe – unpipelined, no hardware multiply • NiosIIs – 5-stages, no branch prediction • NiosIIf – 6-stages, dynamic branch prediction • Caveats • Supports exceptions, OS, and caches • Very similar but tweaked ISA

  28. Design Space vs NiosII Variations

  29. Let’s explore some architecture! Summary • We span the design space • Remain competitive • Achieved 9% faster and 11% smaller than NiosIIs • => don’t suffer from prohibitive overhead

  30. Architectural Axes • Hardware vs Software Multiplication • Shifter implementation • Pipeline • Depth • Organization • Forwarding

  31. Hardware vs Software Multiplication • Hardware multiplication • Increases area & power consumption • Speeds up execution • BUT … • Not all applications care about speed • Not all applications use multiplication (significantly)

  32. Must understand its cost/benefit to decide when to use Cycle Count Speedup of Hardware Multiplication

  33. Cost of Hardware Multiply • ~250 LEs (20%) • 35% more Energy/cycle

  34. Shifter Implementations • Shifters (multiplexers) are big in FPGAs • Consider 3 implementations: • Serial shifter • LUT-based barrel shifter • Multiplier-based barrel shifter

  35. Impact of Shifter Implementation • Consistent across different pipe depths

  36. Multiplier is very nice sweet spot Shifter Implementation Tradeoffs • Averaged over all pipeline depths • Smallest: Serial • Fastest: LUT-based barrel • Energy efficient: Serial

  37. Pipelines - Depth • Study different pipeline depths • Over 3 shifters • Arrows = possible forwarding lines (not used) • All use predict not-taken branches

  38. Pipelining & clock frequency

  39. Impact of Pipelining • Adds area, can increase speed (2 to 3 stage?)

  40. Stall on all loads, and any operand fetches Ifetch Data Mem FPGA Nuance: Synchronous RAMs 2-stage Pipeline Mul Regfile Write Back ALU

  41. Ifetch Data Mem 3-stage Pipeline • Less stalls, increased frequency => Big speedup (1.7x) Mul Regfile Write Back ALU

  42. 3, 4 and 5 stage pipelines • Increased area, small change in performance => Deeper pipelines have potential for better speedups

  43. X X Never squash this stage The 7-stage Pipeline Where Branch Delay Slots break down • The ideal case: … OR JR ADD BEQ

  44. Problem: Separation of Branch and Branch Delay Slot Stalls on RAW hazard … JR ADD BEQ

  45. Problem: Separation of Branch and Branch Delay Slot X … JR ADD NOP BEQ • Must track and protect delay slots

  46. Better off eliminating delay slots – currently researching Multiple Delay Slots • Must detect separation of branch from delay slot • OR prevent multiple delay slots • Stall branch if a delay slot exists in the pipe • We did this one (+30LEs, -15% clock frequency) • Can’t guard all delay slots … OR JR ADD BEQ

  47. Pipeline organization • Where stages are placed is important • Pipe stage placement can • Result in all around “win/loss” • Present a tradeoff

  48. Ifetch Data Mem Forward line rs Mul Reg File Write Back ALU Forward line rt Forwarding • SPREE supports stage to stage forwarding

  49. 20% speed increase Effect of Forwarding

  50. An Aside: ISA Subsetting • Applications don’t generally use all instructions

More Related