1 / 31

Custom Code Generation for Soft Processors

Custom Code Generation for Soft Processors. Martin Labrecque Peter Yiannacouras Gregory Steffan. ECE Dept. University of Toronto. Presented at RAAW 2006, Orlando, FL. Soft Processor: Processor in FPGA. FPGA. Processor. Compelling solution: software programmable

siran
Download Presentation

Custom Code Generation for Soft Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Custom Code Generationfor Soft Processors Martin Labrecque Peter Yiannacouras Gregory Steffan ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL

  2. Soft Processor: Processor in FPGA FPGA Processor • Compelling solution: software programmable • Soft processors are end-user customizable • Different application realm than hard ASIC processors • Can add more features: trade area for performance • Well known approach: add custom instructions (ex. A*B+C) Techniques orthogonal to custom instructions Programmable Logic

  3. Application-Specific Code Generation  Use default gcc, ISA? Interested in app-specific optimizations  Application Compiler Processor Customized for: • Area • Power • Wallclock time • Freq. requirements

  4. Infrastructure

  5. SPREE System(Soft Processor Rapid Exploration Environment) ISA Datapath • Verify ISA against datapath • Datapath Instantiation • Control Generation • Multi-cycle/variable-cycle FUs • Multiplexer select signals • Interlocking • Branch handling SPREE RTL • Output: Synthesizable Verilog [CASES 05, FPGA 06] • Input: Processor description • Made of hand-coded components Processor Description • SPREE System

  6. Back-End Infrastructure RTL 20Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Stratix 1S40C5 Cycle Count 2. Area 3. Clock Frequency 4. Power We can measure area/performance/energy accurately Modelsim RTL Simulator Quartus II 5.0 CAD Software

  7. Area efficiency #Million Instr. x Frequency  # Cycles x Area • A combined metric: MIPS #Million Instr.  1000 LEs WallclockTime x Area • 4 criteria trading-off (power not included) • Want app-specific ( average) improvement

  8. Representative Processors <900 LEs, <70 MHz >1500 LEs, >100 MHz F: Fetch D: Decode R: Register EX: Execute M: Memory WB: Writeback Serial F/D/R/EX/WB Pipe3 F/D R/EX/M WB Pipe5 F D R/EX1 EX2/M WB EX1 WB2 F D R EX2/M EX3/WB1 Pipe7

  9. SPREE vs Nios II Serial faster Pipe7 Pipe5 Pipe3 smaller

  10. Code Generation Options Studied( Outline ) Low-level hardware-software tradeoffs Reducing hardware shift support Removing hazard detection logic Impact of unique ISA features Removing delay slots Hi/Lo registers vs 3-operand multiplies Using unaligned memory load and stores Application-specific register management Operand scheduling and forwarding lines Limiting the use of architected registers Combining these into app-specific optimizations    

  11. Reducing Hardware Shift Support Best performance per area: Using hard multiplier for shifting Multiplications and shifts: both in software? Software shifting using additions & subtractions Impact of removing the dedicated LUT-based shifter? Costs ~250LEs, 30% of smallest soft processor Can we have partial hardware support for shifting? 

  12. 343 LEs 48 LEs 2 fixed-amount shifters is cheap! Area for Various Shift Strategies (Pipe3)

  13. Dynamic Instructions Containing Shifts less than 2% of shift amounts are variable some benchmarks have very few shifts Percentage

  14. How to get rid of the shifter Software-only shifts require an order of magnitude more cycles to compute Measure the cost in cycles for each shift operation Replace shifts by hard shifts and/or software shifts: Srl 8 Srl 8 Srl 4 Srl 4 Srl 4 Srl 4 Srl 3 Srl 3 Srl 3 Srl 3 Srl 3 Shift_left(1) ... Srl 16 or or or Evaluate cost in cycles for all combinations of shifters available

  15. Impact of up to 2 Fixed-Amount Shifters (pipe3) Can improve area efficiency by up to 65% Beneficial for certain applications only Area efficiency (MIPS/1000LE)

  16. PC hazard avoided Instr. in delay slot Branch/ Jump F/D R/EX/M WB Time Time  Removing Delay Slots load hazard avoided • Default MIPS has branch and load delay slots • Under what conditions are they worth it? • Load delay slots need no additional hardware support • Because of hazard detection in the processor • Branch delay slots require hardware support • We only have predict-not-taken so far • Are working on better branch prediction Instr. in delay slot Load F/D R/EX/M WB

  17. Removing Load Delay Slots (serial) 3% better performance for Serial, 2% for Pipe3 Normalized Wall-Clock Time

  18. Removing Branch Delay Slots pipe3: 7% performance hit pipe7 improvements: 13% freq, 8% performance

  19. 3-Operand Multiplies vs Hi/Lo Registers Default MIPS has Hi/Lo registers Motivated by multi-cycle multiplies Hold multiplication results (Hi and Lo each 32 bits) Two special instructions to access Hi/Lo Which to choose? • 3-operand multiplies (NIOS2 and Microblaze) • Two instructions compute high and low parts • Result is stored in register file Hi/Lo Register file Multiplier MUX

  20. Impact of 3-Operand Multiplies 8% slower clock Saves area, reduces frequency, increases power Normalized Value

  21. Impact of 3-Operand Multiplies Only pipe3 benefits from cycle savings

  22. Forwarding Lines and Code Generation Necessary to forward both operands (A and B)? Simultaneous dependences Non commutative operations  r3 = r1 + r2 r4 = r3 + r3 r3 = r1 + r2 r4 = r5 - r3 r3 = r1 + r2 r4 = r3 – r5 • Compiler can reorder commutative operands of instrs • Can compiler compensate when only one forwarding line? • Save ~30 LEs for fwding line and incur more stall cycles? Added 1-2% cycle improvement with 1 fwding line 3-4% short of 2 fwding lines’ performance for 30% of apps, 1 fwding line more area efficient

  23. Soft Processor Customization Techniques • Best overall (general purpose) processor • Best per application (application-tuned) • Reduce processor by reducing the ISA (Subset) SPREE automatically removes • Unused connections • Unused components • Unused parts of the ISA • Apply optimization techniques (Opt)

  24. Average Combined Improvements (pipe3) Subsetting & Opts +25% 36% Opts +12% Efficiency (MIPS/1000LEs) App-Specific: +11% Opt: 2 fixed shifts, no dly slots, 3-op mult, op sched overall 36% improvement in efficiency! Subsetting +8%

  25. Summary Software-only and custom shifters Load delay slots Branch delay slots 3-operand multiply Operand scheduling to save a forwarding line App-specific Useless with hazard detection Useful with poor branch prediction Processor-specific App-specific  12% area efficiency over app-specific processor 17% area eff. over subsetted app-specific proc.  without adding complexity! Conclusion

  26. Future Research Integrating branch prediction in SPREE Research on memory hierarchy Automatic selection of app-specific SP features

  27. Thank you

  28. Architectural Parameters Used in SPREE Multiplication Support Hardware FU or software routine Shifter implementation Flipflops, multiplier, or LUTs Pipelining Depth (2-7 stages) Forwarding lines We focus on core microarchitecture (for now)

  29. No specific evaluation of studied features in SP Related work • Custard [Dimond, Mencer, Luk] • Customizable forwarding lines • Optional delay slots • NIOS II [Altera] • 3-operand multiply • No delay slots • Microblaze [Xilinx] • 3-operand multiply • Branches with and without delay slots

  30. Removing some/all hazard detection logic • Can the compiler compensate with scheduling? • E.g., worst case use no-ops to ensure correctness • Challenge: variable, multi-cycle instructions • What is the cost/benefit of doing so? F/D R/EX/M WB Pipe3 Potential hazard F/D stall R/EX/M WB Time

  31. Up to 10% area and 6% frequency gains Impact of Removing Hazard Detection Logic

More Related