260 likes | 466 Views
Improving Memory System Performance for Soft Vector Processors. Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008. Soft Processors in FPGA Systems. Data-level parallelism → soft vector processors. Soft Processor. Custom Logic. C + Compiler. HDL + CAD.
E N D
Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008
Soft Processors in FPGA Systems Data-level parallelism → soft vector processors Soft Processor Custom Logic C + Compiler HDL + CAD Easier Faster Smaller Less Power Configurable – how can we make use of this?
Vector Processing Primer vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0] 1 Vector Lane
Vector Processing Primer 16x speedup vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] 16 Vector Lanes b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0]
Sub-Linear Scalability Vector lanes not being fully utilized
Where Are The Cycles Spent? 2/3 cycles spent waiting on memory unit, often from cache misses 16 lanes 67%
Our Goals • Improve memory system • Better cache design • Hardware prefetching • Evaluate improvements for real: • Using a complete hardware design (in Verilog) • On real FPGA hardware (Stratix 1S80C6) • Running full benchmarks (EEMBC) • From off-chip memory (DDR-133MHz)
Current Infrastructure VC RF VC WB Logic VS RF VS WB Mem Unit Decode Repli- cate Hazard check VR RF VR RF VR WB VR WB Satu- rate Satu- rate M U X M U X A L U A L U x & satur. Rshift x & satur. Rshift SOFTWARE HARDWARE Verilog EEMBC C Benchmarks GCC ld scalar μP ELF Binary + Vectorized assembly subroutines GNU as + vpu Vector support MINT Instruction Set Simulator Modelsim (RTL Simulator) Altera Quartus II v 8.0 area, frequency cycles verification verification
VESPA Architecture Design Icache Dcache M U X WB Decode RF A L U Scalar Pipeline 3-stage Shared Dcache VC RF VC WB Supports integer and fixed-point operations, and predication Vector Control Pipeline 3-stage Logic Decode VS RF VS WB Mem Unit Decode Repli- cate Hazard check VR RF Vector Pipeline 6-stage VR RF VR WB M U X VR WB M U X A L U A L U Satu- rate Satu- rate 32-bit datapaths x & satur. Rshift x & satur. Rshift 10
Memory System Design … vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar Dcache 4KB, 16B line DDR 9 cycle access DDR
Memory System Design … Reduced cache accesses + some prefetching vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar 4x Dcache 16KB, 64B line 4x DDR 9 cycle access DDR
Improving Cache Design • Vary the cache depth & cache line size • Using parameterized design • Cache line size: 16, 32, 64, 128 bytes • Cache depth: 4, 8, 16, 32, 64 KB • Measure performance on 9 benchmarks • 6 from EEMBC, all executed in hardware • Measure area cost • Equate silicon area of all resources used • Report in units of Equivalent LEs
Cache Design Space – Performance (Wall Clock Time) Best cache design almost doubles performance of original VESPA More pipelining/retiming could reduce clock frequency penalty Cache line more important than cache depth (lots of streaming) 122MHz 123MHz 126MHz 129MHz
Cache Design Space – Area 16bits 16bits 16bits 16bits 16bits 16bits 16bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits System area almost doubled in worst case 64B (512 bits) … M4K 32 => 16KB of storage MRAM
Cache Design Space – Area b) Don’t use MRAMs: big, few, and overkill a) Choose depth to fill block RAMs needed for line size M4K MRAM
Hardware Prefetching Example … … No Prefetching Prefetching 3 blocks vld.w vld.w vld.w vld.w MISS MISS MISS HIT Dcache Dcache 9 cycle penalty 9 cycle penalty DDR DDR
Hardware Data Prefetching We measure performance/area using a 64B, 16KB dcache • Advantages • Little area overhead • Parallelize memory fetching with computation • Use full memory bandwidth • Disadvantages • Cache pollution • We use Sequential Prefetching triggered on: • a) any miss, or • b) sequential vector instruction miss
Prefetching K Blocks – Any Miss Only half the benchmarks significantly sped-up, max of 2.2x, avg 28% Peak averagespeedup 28% Not receptive 2.2x
Prefetching Area Cost: Writeback Buffer … … Prefetching 3 blocks • Two options: • Deny prefetch • Buffer all dirty lines • Area cost is small • 1.6% of system area • Mostly block RAMs • Little logic • No clock frequency impact vld.w WB Buffer MISS dirty lines Dcache 9 cycle penalty DDR
Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector
Vector Length Prefetching • Previously: constant# cache lines prefetched • Now: Use multiple of vector length • Only for sequential vector memory instructions • Eg. Vector load of 32 elements • Guarantees <= 1 miss per vector memory instr 0 31 vld.w fetch + prefetch 28*k
Vector Length Prefetching - Performance 1*VL prefetching provides good speedup without tuning, 8*VL best Peak 29% Not receptive 21% 2.2x no cache pollution
Overall Memory System Performance Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but4% of miss cycles 67% 48% 31% 4% (4KB) (16KB) 15
Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10xfor 16 lanes
Summary • Explored cache design • ~2x performance for ~2x system area • Area growth due largely to memory crossbar • Widened cache line size to 64B and depth to 16KB • Enhanced VESPA w/ hardware data prefetching • Up to 2.2x performance, average of 28% for K=15 • Vector length prefetcher gains 21% on average for 1*VL • Good for mixed workloads, no tuning, no cache pollution • Peak at 8*VL, average of 29% speedup • Overall improved VESPA memory system & scalability • Decreased miss cycles to 4%, • Decreased memory unit stall cycles to 31%
Vector Memory Unit Memory Request Queue base rddata0 rddata1 stride*0 rddataL M U X stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... … Memory Lanes=4 wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …