Improving Pipelined Soft Processors with Multithreading

ECE Dept. University of Toronto Presented at RAAW 2006, Orlando, FL Improving Pipelined Soft Processors with Multithreading Martin Labrecque Gregory Steffan

Processors and FPGAs FPGA Processor Custom Logic • Soft processors are: • Easier to program than HDL • Customizable • FPGAs increasingly implement SoCs, with CPUs • Soft processors: processors in theFPGA fabric

Soft processors in Embedded Systems Instr. Count xx Frequency Performance  Cycle Count x Area Area We trade-off 4 criteria (soft proc. power is related to area) • What do designers care about? • Minimizing area? • Matching frequency? • Hitting performance target? • Area efficiency: a combined metric MIPS 1000 LEs

Multithreading Million Instr. xx Frequency # Cycles x Area • Replace processor stalls • Fill them with instructions from other threads • When to switch thread? • Every instruction (e.g. Sun’s Niagara) • Convenient technique for in-order processors Fine-grained multithreading: 1 instr. per thread in round-robin

Avoiding processor stall cycles • Multithreading: execute streams of independent instructions Legend Thread1 Thread2 Thread3 F F F F F F F Ideally, eliminates all stalls E E E E E E E AFTER 3 stages W W W W W W W Time F F F F • Data and control hazards create stall cycles Traditional execution E E E E 3 stages BEFORE W W W W Time

How useful is multithreading? • Commercial SPs: single-threaded (NIOS-II,Microblaze) • Fort et al. [FCCM’06] have shown: • multithreaded SP smaller than multiple SPs • with some performance degradation • We go further by showing that: the Area-Efficiency of Multithreaded SP is GREATER THAN the Area-Efficiency of Single-Threaded SP Not straightforward, here is how we did it

Outline Architectural Support for Multiple Threads • Architectural Support for Multiple Threads • Soft Processor Infrastructure • Improvements to Baseline Multithreading

Single-Threaded Processor (simplified) Forwarding lines Data Mem P C Reg. Array Instr. Mem ALU +4 Hazard Detection Logic

2-Threaded Processor (simplified) Replicate state for each thread Hazard Detection Logic Data Mem P C Reg. Array Instr. Mem ALU P C Ctrl. +4 • Simplify control logic

Additional storage for multiple threads More efficiently done in FPGA than in ASIC Increase memory size while preserving frequency Program counters Data mem. Registers N x Multithreading builds on the strengths of FPGAs

Outline • Architectural Support for Multiple Threads • Soft Processor Infrastructure • Improvements to baseline multithreading

Measurement Infrastructure RTL Benchmarks (MiBench, Dhrystone 2.1, RATES, XiRisc) Modelsim RTL Simulator Quartus II 5.0 CAD Software Stratix 1S40C5 Cycle Count 2. Resource Usage 3. Clock Frequency 4. Power We can measure area/performance/energy accurately Single-Thread ProcessorsSPREE System [FPGA’06]

Evaluation methodology • Same benchmark running on all threads • Some mixed benchmarks results in the paper • Run until completion of the last thread • Same instruction space • We present results with fixed latency on-chip RAM • We are implementing a solution for off-chip RAM

Processors: 3, 5 and 7 stages Pipe3 Pipe3 F/D R/EX/M WB Pipe5 Pipe5 F D R/EX1 EX2/M WB Pipe7 EX1 WB2 F D R EX2/M EX3/WB1 Pipe7 F: Fetch D: Decode R: Register EX: Execute M: Memory WB: Writeback 1174 LEs 78.3 MHz 1283 LEs 86.79 MHz 1557 LEs, 100.59 MHz Best of each pipeline depth generated by SPREE By default: thread count = number of pipeline stages

Area efficiency results 77% 33% 106% 3-stage 5-stage 7-stage • Area efficiency is most improved with deeper pipelines • 3- and 7-stages have similar area efficiency

IPC results for 3, 5 and 7 stages Ideal IPC = 1 IPC versus single-threaded proc. 24%, 45% and 104% more instructions per cycle, respectively

Improvements to the Baseline Multithreaded Soft Processors • Optimize away unpipelined multi-cycle paths • Selection of architectural features • Multiplier implementation • Number of registers • Number of threads • Optimize away unpipelined multi-cycle paths Combination of techniques optimizing area efficiency

1- Changing multiplication support • 3-operand multiplies (NIOS2 and Microblaze) • Two instructions compute high and low parts • Avoids replicating Hi and Lo registers support • Default MIPS has Hi/Lo registers Hi/Lo Register file Multiplier MUX

2- Reducing the register file Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks 1..N 1..N 1..N-k 1..N-k 2N-2k 2N • Applicable to the 5-stage processor • Increases slightly cycle count due to increased register pressure • Allows area and frequency improvements

Reducing the Number of Threads Usually: # threads = # pipeline stages Last stage: writeback to non-conflicting register F F F F E E E E W W W W Legend Thread1 Thread2 Thread3 F F E E 3 stages W W Time Positive effect on the 5 and 7-stage processors Helps meet processing latency deadline (shorter round-robin) Gives designers more flexibility

Conclusions • Multithreaded SPs outperforms Single-threaded • Assumes independent threads • Assumes use of on-chip memory • 33%, 77% and 106% increase in area-efficiency • Demonstrated that benefits increase with pipeline depth • Techniques to optimize away unpipelined multi-cycle paths • Selection and combination of architectural features • Multiplier support • Number of threads • Number of registers Commercial FPGA makers should have a Multi-Threaded SP

Long term goals Multiple multithreaded soft processors Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people • Experimental Testbed: NetFPGA • Virtex-II Pro • 4 x 1 Gbps Ethernet • PCI board • 64 MB DDR2 DRAM • Stanford/Xilinx platform • Collaboration with network researchers Perform real high bandwidth experiments

Thank you ECE Dept. University of Toronto Martin Labrecque (martinl@eecg.utoronto.ca) Gregory Steffan

Where do threads come from? • Event processing • e.g. multiple sources of interrupts • Packet processing • e.g. CAN, RS-485, Ethernet, etc. • Systems handling requests • e.g. bus controllers For now, we consider independent threads

SPREE vs Nios II [IEEE TCAD’07] faster smaller

Architectural Parameters Used in SPREE We focus on core microarchitecture (for now) • Multiplication Support • Hardware FU or software routine • Shifter implementation • Flipflops, multiplier, or LUTs • Pipelining • Depth • (2-7 stages) • Forwarding lines

Contributions on Multithreaded Soft Processors • Multithreaded SP dominate single-threaded • processors in area and IPC • Demonstrated that these benefits • Increase with the # of pipeline stages • Explained techniques to optimize away • unpipelined multi-cycle paths • Selection of architectural features • Number of threads • Number of registers • Multiplier support Combination of techniques that optimize area efficiency

Unpipelined Multicycle Paths F/D F/D R/EX R/EX EX WB M WB Example of 3-stage pipeline with multicycle on load, store, shift and multiplies • ST • MT Not practical in ST because of hazard detection Important source of IPC improvement

Changing multiplication support 3-stage 5-stage 7-stage For multithreaded SPs, 3op-multiplies always win

Reducing the Number of Threads Positive effect on the 5 and 7-stage processors

SPREE System(Soft Processor Rapid Exploration Environment) Processor Description ISA Datapath SPREE RTL • Input: Processor description • Made of hand-coded components • SPREE System • Verify ISA against datapath • Datapath Instantiation • Control Generation • Output: Synthesizable Verilog

Multithreading Million Instr. xx Frequency # Cycles x Area Interleaved instructions in pipeline T1 T2 T3 T1 T2 T3 Time • Replace processor stalls • Fill them with instructions from other threads • When to switch thread? • Multiple techniques • Most common: every instruction (e.g. Sun’s Niagara) Fine-grained multithreading: 1 instr. per thread in round-robin

Experimental Testbed: NetFPGA • Virtex-II Pro • 4 x 1 Gbps Ethernet • PCI board • 64 MB DDR2 DRAM • Stanford/Xilinx platform • Collaboration with network researchers Perform real high bandwidth experiments

Removed load and branch delay slots in the code

Improving Pipelined Soft Processors with Multithreading