Fine-Grain Performance Scaling of Soft Vector Processors

Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct 13, 2009

FPGA Systems and Soft Processors Target: Data level parallelism → vector processors Simplify FPGA design: Customize soft processor architecture Digital System computation Weeks Months Software + Compiler HDL + CAD Soft Processor Hard Processor Custom HW Used in 25% of designs [source: Altera, 2009] Faster Smaller Less Power Easier COMPETE ?Configurable Specialized device, increased cost Board space, latency, power

Vector Processing Primer vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0] 1 Vector Lane

Vector Processing Primer 16x speedup vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c 16 Vector Lanes vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] • Previous Work (on Soft Vector Processors): • Scalability • Flexibility • Portability • CASES’08 vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0]

VESPA Architecture Design(Vector Extended Soft Processor Architecture) Icache Dcache Legend Pipe stage Logic Storage M U X WB Decode RF A L U Scalar Pipeline 3-stage VC RF VC WB Supports integer and fixed-point operations [VIRAM] Vector Control Pipeline 3-stage Logic Shared Dcache Decode VS RF VS WB Decode Repli- cate Hazard check VR RF VR WB Vector Pipeline 6-stage VR RF Lane 1 ALU,Mem Unit VR WB Lane 2 ALU, Mem, Mul 32-bit Lanes

In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space

Evaluation Infrastructure Evaluate soft vector processors with high accuracy SOFTWARE HARDWARE EEMBC Benchmarks GCC Compiler Verilog Full hardware design of VESPA soft vector processor ld Binary Vectorized assembly subroutines GNU as Stratix III 340 FPGA CAD Software Instruction Set Simulation RTL Simulation area, power, clock frequency DDR2 cycles verification verification

VESPA Scalability Up to 19x, average of 11xfor 32 lanes → good scaling Powerful parameter … but is coarse-grained 19x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) 11x

Vector Lane Design Space Too coarse grain! Reprogrammability allows more exact-fit 8% of largest FPGA (Equivalent ALMs)

Vector Chaining • Simultaneous execution of independent element operations within dependent instructions Dependent Instructions vadd vadd vr10, vr1,vr2 0 1 2 3 4 5 6 7 dependency vmul vr20, vr10,vr11 0 1 2 3 4 5 6 7 vmul Independent Element Operations

Vector Chaining in VESPA M U X M U X A L U A L U A L U A L U A L U A L U A L U A L U Performance increase if instructions correctly scheduled Lanes=4 Vector Register File Single Instruction Execution No Vector Chaining Unified vadd B=1 vmul Mem Mem time Mem Mul Vector Register File Multiple Instruction Execution With Vector Chaining Bank 0 vadd B=2 vmul Bank 1 Mem Mem time Mem Mul

ALU Replication ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU M U X M U X M U X M U X M U X Vector Register File Lanes=4 With Vector Chaining Single Instruction Execution Bank 0 B=2 APB=false vadd vsub Bank 1 Mem Mem Mem Mul time Vector Register File Multiple Instruction Execution With Vector Chaining Bank 0 vadd B=2 APB=true vsub Bank 1 time Mem Mem Mem Mul

Vector Chaining Speedup(on an 8-lane VESPA) Chaining can be quite costly in area: 27%-92% Performance is application dependent: 5%-76% Significant speed improvement over no chaining (22-35% avg) More fine-grain vs double lanes: 19-89% speed, 86% area More banks More banks More ALUs More ALUs Don’t care Cycle Speedup vs No Chaining

Heterogeneous Lanes ALU ALU ALU ALU 4 Lanes (L=4) Lane 1 2 Multiplier Lanes (X=2) Mul vmul Lane 2 Mul Lane 3 Mul Lane 4 Mul

Heterogeneous Lanes ALU ALU ALU ALU Save area, but reduce speed depending on demand on the multiplier 4 Lanes (L=4) Lane 1 2 Multiplier Lanes (X=2) STALL! Mul vmul Lane 2 Mul Lane 3 Lane 4

Impact of Heterogeneous Lanes(on a 32-lane VESPA) Performance penalty is application dependent: 0%-85% Modest area savings (6%-13%) – dedicated multipliers Free Expensive Moderate

Design Space Exploration usingVESPA Architectural Parameters Compute Architecture Instruction Set Architecture Memory Architecture

VESPA Design Space (768 architectural configurations) Fine-grain design space allows better-fit architecture Evidence of efficiency: trade performance and area 1:1 28x range Normalized Wall Clock Time 18x range 4x 4x 1 2 4 8 16 32 64 Normalized Coprocessor Area

Summary Use software for non-critical data-parallel computation • Evaluated VESPA on modern FPGA hardware • Scale up to 32 lanes with 11x average speedup • Augmented VESPA with fine-tunable parameters • Vector Chaining (by banking the register file) • 22-35% better average performance than without • Chaining configuration impact very application-dependent • Heterogeneous Lanes – lanes w/o multipliers • Multipliers saved, costs performance (sometimes free) • Explored a vast architectural design space • 18x range in performance, 28x range in area

Thank You! • VESPA release: http://www.eecg.utoronto.ca/VESPA

VESPA Parameters Compute Architecture Instruction Set Architecture Memory Architecture

VESPA Scalability Up to 27x, average of 15xfor 32 lanes → good scaling Powerful parameter … but too coarse-grained 27x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) 15x

Proposed Soft Vector Processor System Design Flow We propose adding vector extensions to existing soft processors We want to evaluate soft vector processors for real www.fpgavendor.com User Code + Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Portable Is the soft processor the bottleneck? Custom HW Portable, Flexible, Scalable Soft Proc Vector Lane 1 Vector Lane 2 Vector Lane 3 Vector Lane 4 Memory Interface Peripherals yes, increase lanes

Vector Memory Unit Memory Request Queue base rddata0 rddata1 stride*0 rddataL M U X stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... … Memory Lanes=4 wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …

Overall Memory System Performance Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but4% of miss cycles 16 lanes 67% 48% 31% 4% (4KB) (16KB) 15

Fine-Grain Performance Scaling of Soft Vector Processors

Fine-Grain Performance Scaling of Soft Vector Processors

Presentation Transcript

Vector Processors

Conjoining Soft-Core FPGA Processors

Vector processors

Fine Grain MPI

Fine-Grain Communication

Fine Grain Entities Recognition

A Performance Model for Fine-Grain Accesses in UPC

Soft Vector Processors with Streaming Pipelines

Understanding Performance Metrics of Processors

Fine-Grain Parallelism

Vector Processors Part 2

The Microarchitecture of FPGA-Based Soft Processors

Fine-Grain Time Division Multiplexing of Ethernet

Performance Vector

Multiprocessor Concluding Remarks, Vector Processors

The Microarchitecture of FPGA-Based Soft Processors

Java Meets Fine-grain Multithreading

Chapter 4 Vector Processors

High Performance Processors

Scaling and Performance

Improving Memory System Performance for Soft Vector Processors

Chapter 4 Vector Processors