280 likes | 447 Views
Fine-Grain Performance Scaling of Soft Vector Processors. Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct 13, 2009. FPGA Systems and Soft Processors. Target: Data level parallelism → vector processors.
E N D
Fine-Grain Performance Scaling of Soft Vector Processors Peter Yiannacouras Jonathan Rose Gregory J. Steffan ESWEEK – CASES 2009, Grenoble, France Oct 13, 2009
FPGA Systems and Soft Processors Target: Data level parallelism → vector processors Simplify FPGA design: Customize soft processor architecture Digital System computation Weeks Months Software + Compiler HDL + CAD Soft Processor Hard Processor Custom HW Used in 25% of designs [source: Altera, 2009] Faster Smaller Less Power Easier COMPETE ?Configurable Specialized device, increased cost Board space, latency, power
Vector Processing Primer vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0] 1 Vector Lane
Vector Processing Primer 16x speedup vadd // C code for(i=0;i<16; i++) c[i]=a[i]+b[i] // Vectorized code set vl,16 vload vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c 16 Vector Lanes vr2[15]=vr0[15]+vr1[15] vr2[14]=vr0[14]+vr1[14] vr2[13]=vr0[13]+vr1[13] • Previous Work (on Soft Vector Processors): • Scalability • Flexibility • Portability • CASES’08 vr2[12]=vr0[12]+vr1[12] vr2[11]=vr0[11]+vr1[11] vr2[10]=vr0[10]+vr1[10] vr2[9]= vr0[9]+vr1[9] vr2[8]= vr0[8]+vr1[8] vr2[7]= vr0[7]+vr1[7] vr2[6]= vr0[6]+vr1[6] vr2[5]= vr0[5]+vr1[5] vr2[4]= vr0[4]+vr1[4] Each vector instruction holds many units of independent operations vr2[3]= vr0[3]+vr1[3] vr2[2]= vr0[2]+vr1[2] vr2[1]= vr0[1]+vr1[1] vr2[0]= vr0[0]+vr1[0]
VESPA Architecture Design(Vector Extended Soft Processor Architecture) Icache Dcache Legend Pipe stage Logic Storage M U X WB Decode RF A L U Scalar Pipeline 3-stage VC RF VC WB Supports integer and fixed-point operations [VIRAM] Vector Control Pipeline 3-stage Logic Shared Dcache Decode VS RF VS WB Decode Repli- cate Hazard check VR RF VR WB Vector Pipeline 6-stage VR RF Lane 1 ALU,Mem Unit VR WB Lane 2 ALU, Mem, Mul 32-bit Lanes
In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space
Evaluation Infrastructure Evaluate soft vector processors with high accuracy SOFTWARE HARDWARE EEMBC Benchmarks GCC Compiler Verilog Full hardware design of VESPA soft vector processor ld Binary Vectorized assembly subroutines GNU as Stratix III 340 FPGA CAD Software Instruction Set Simulation RTL Simulation area, power, clock frequency DDR2 cycles verification verification
VESPA Scalability Up to 19x, average of 11xfor 32 lanes → good scaling Powerful parameter … but is coarse-grained 19x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) 11x
Vector Lane Design Space Too coarse grain! Reprogrammability allows more exact-fit 8% of largest FPGA (Equivalent ALMs)
In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space
Vector Chaining • Simultaneous execution of independent element operations within dependent instructions Dependent Instructions vadd vadd vr10, vr1,vr2 0 1 2 3 4 5 6 7 dependency vmul vr20, vr10,vr11 0 1 2 3 4 5 6 7 vmul Independent Element Operations
Vector Chaining in VESPA M U X M U X A L U A L U A L U A L U A L U A L U A L U A L U Performance increase if instructions correctly scheduled Lanes=4 Vector Register File Single Instruction Execution No Vector Chaining Unified vadd B=1 vmul Mem Mem time Mem Mul Vector Register File Multiple Instruction Execution With Vector Chaining Bank 0 vadd B=2 vmul Bank 1 Mem Mem time Mem Mul
ALU Replication ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU M U X M U X M U X M U X M U X Vector Register File Lanes=4 With Vector Chaining Single Instruction Execution Bank 0 B=2 APB=false vadd vsub Bank 1 Mem Mem Mem Mul time Vector Register File Multiple Instruction Execution With Vector Chaining Bank 0 vadd B=2 APB=true vsub Bank 1 time Mem Mem Mem Mul
Vector Chaining Speedup(on an 8-lane VESPA) Chaining can be quite costly in area: 27%-92% Performance is application dependent: 5%-76% Significant speed improvement over no chaining (22-35% avg) More fine-grain vs double lanes: 19-89% speed, 86% area More banks More banks More ALUs More ALUs Don’t care Cycle Speedup vs No Chaining
In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space
Heterogeneous Lanes ALU ALU ALU ALU 4 Lanes (L=4) Lane 1 2 Multiplier Lanes (X=2) Mul vmul Lane 2 Mul Lane 3 Mul Lane 4 Mul
Heterogeneous Lanes ALU ALU ALU ALU Save area, but reduce speed depending on demand on the multiplier 4 Lanes (L=4) Lane 1 2 Multiplier Lanes (X=2) STALL! Mul vmul Lane 2 Mul Lane 3 Lane 4
Impact of Heterogeneous Lanes(on a 32-lane VESPA) Performance penalty is application dependent: 0%-85% Modest area savings (6%-13%) – dedicated multipliers Free Expensive Moderate
In This Work • Evaluate for real using modern hardware • Scale to 32 lanes (previous work did 16 lanes) • Add more fine-grain architectural parameters • Scale more finely • Augment with parameterized vector chaining support • Customize to functional unit demand • Augment with heterogeneous lanes • Explore a large design space
Design Space Exploration usingVESPA Architectural Parameters Compute Architecture Instruction Set Architecture Memory Architecture
VESPA Design Space (768 architectural configurations) Fine-grain design space allows better-fit architecture Evidence of efficiency: trade performance and area 1:1 28x range Normalized Wall Clock Time 18x range 4x 4x 1 2 4 8 16 32 64 Normalized Coprocessor Area
Summary Use software for non-critical data-parallel computation • Evaluated VESPA on modern FPGA hardware • Scale up to 32 lanes with 11x average speedup • Augmented VESPA with fine-tunable parameters • Vector Chaining (by banking the register file) • 22-35% better average performance than without • Chaining configuration impact very application-dependent • Heterogeneous Lanes – lanes w/o multipliers • Multipliers saved, costs performance (sometimes free) • Explored a vast architectural design space • 18x range in performance, 28x range in area
Thank You! • VESPA release: http://www.eecg.utoronto.ca/VESPA
VESPA Parameters Compute Architecture Instruction Set Architecture Memory Architecture
VESPA Scalability Up to 27x, average of 15xfor 32 lanes → good scaling Powerful parameter … but too coarse-grained 27x (Area=1) (Area=1.3) (Area=1.9) (Area=3.2) (Area=6.3) (Area=12.3) 15x
Proposed Soft Vector Processor System Design Flow We propose adding vector extensions to existing soft processors We want to evaluate soft vector processors for real www.fpgavendor.com User Code + Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Vectorized Software Routine Portable Is the soft processor the bottleneck? Custom HW Portable, Flexible, Scalable Soft Proc Vector Lane 1 Vector Lane 2 Vector Lane 3 Vector Lane 4 Memory Interface Peripherals yes, increase lanes
Vector Memory Unit Memory Request Queue base rddata0 rddata1 stride*0 rddataL M U X stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... … Memory Lanes=4 wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …
Overall Memory System Performance Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but4% of miss cycles 16 lanes 67% 48% 31% 4% (4KB) (16KB) 15