Improving Memory System Performance for Soft Vector Processors

Improving Memory System Performance for Soft Vector Processors Peter Yiannacouras J. Gregory Steffan Jonathan Rose WoSPS – Oct 26, 2008

Soft Processors in FPGA Systems Data-level parallelism → soft vector processors Soft Processor Custom Logic C + Compiler HDL + CAD  Easier  Faster  Smaller  Less Power  Configurable – how can we make use of this?

Vector Processing Primer vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0] 1 Vector Lane

Vector Processing Primer 16x speedup vadd // C code for(i=0;i<16; i++) b[i]+=a[i] // Vectorized code set vl,16 vload vr0,b vload vr1,a vadd vr0,vr0,vr1 vstore vr0,b b[15]+=a[15] 16 Vector Lanes b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] b[11]+=a[11] b[10]+=a[10] b[9]+=a[9] b[8]+=a[8] b[7]+=a[7] b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction holds many units of independent operations b[3]+=a[3] b[2]+=a[2] b[1]+=a[1] b[0]+=a[0]

Sub-Linear Scalability Vector lanes not being fully utilized

Where Are The Cycles Spent? 2/3 cycles spent waiting on memory unit, often from cache misses 16 lanes 67%

Our Goals • Improve memory system • Better cache design • Hardware prefetching • Evaluate improvements for real: • Using a complete hardware design (in Verilog) • On real FPGA hardware (Stratix 1S80C6) • Running full benchmarks (EEMBC) • From off-chip memory (DDR-133MHz)

Current Infrastructure VC RF VC WB Logic VS RF VS WB Mem Unit Decode Repli- cate Hazard check VR RF VR RF VR WB VR WB Satu- rate Satu- rate M U X M U X A L U A L U x & satur. Rshift x & satur. Rshift SOFTWARE HARDWARE Verilog EEMBC C Benchmarks GCC ld scalar μP ELF Binary + Vectorized assembly subroutines GNU as + vpu Vector support MINT Instruction Set Simulator Modelsim (RTL Simulator) Altera Quartus II v 8.0 area, frequency cycles verification verification

VESPA Architecture Design Icache Dcache M U X WB Decode RF A L U Scalar Pipeline 3-stage Shared Dcache VC RF VC WB Supports integer and fixed-point operations, and predication Vector Control Pipeline 3-stage Logic Decode VS RF VS WB Mem Unit Decode Repli- cate Hazard check VR RF Vector Pipeline 6-stage VR RF VR WB M U X VR WB M U X A L U A L U Satu- rate Satu- rate 32-bit datapaths x & satur. Rshift x & satur. Rshift 10

Memory System Design … vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar Dcache 4KB, 16B line DDR 9 cycle access DDR

Memory System Design … Reduced cache accesses + some prefetching vld.w (load 16 contiguous 32-bit words) VESPA 16 lanes Scalar Vector Coproc Lane 0 Lane 0 Lane 0 Lane 4 Lane 0 Lane 0 Lane 0 Lane 8 Lane 0 Lane 0 Lane 0 Lane 12 Lane 4 Lane 4 Lane 15 Lane 16 Vector Memory Crossbar 4x Dcache 16KB, 64B line 4x DDR 9 cycle access DDR

Improving Cache Design • Vary the cache depth & cache line size • Using parameterized design • Cache line size: 16, 32, 64, 128 bytes • Cache depth: 4, 8, 16, 32, 64 KB • Measure performance on 9 benchmarks • 6 from EEMBC, all executed in hardware • Measure area cost • Equate silicon area of all resources used • Report in units of Equivalent LEs

Cache Design Space – Performance (Wall Clock Time) Best cache design almost doubles performance of original VESPA More pipelining/retiming could reduce clock frequency penalty Cache line more important than cache depth (lots of streaming) 122MHz 123MHz 126MHz 129MHz

Cache Design Space – Area 16bits 16bits 16bits 16bits 16bits 16bits 16bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits 4096 bits System area almost doubled in worst case 64B (512 bits) … M4K 32 => 16KB of storage MRAM

Cache Design Space – Area b) Don’t use MRAMs: big, few, and overkill a) Choose depth to fill block RAMs needed for line size M4K MRAM

Hardware Prefetching Example … … No Prefetching Prefetching 3 blocks vld.w vld.w vld.w vld.w MISS MISS MISS HIT Dcache Dcache 9 cycle penalty 9 cycle penalty DDR DDR

Hardware Data Prefetching We measure performance/area using a 64B, 16KB dcache • Advantages • Little area overhead • Parallelize memory fetching with computation • Use full memory bandwidth • Disadvantages • Cache pollution • We use Sequential Prefetching triggered on: • a) any miss, or • b) sequential vector instruction miss

Prefetching K Blocks – Any Miss Only half the benchmarks significantly sped-up, max of 2.2x, avg 28% Peak averagespeedup 28% Not receptive 2.2x

Prefetching Area Cost: Writeback Buffer … … Prefetching 3 blocks • Two options: • Deny prefetch • Buffer all dirty lines • Area cost is small • 1.6% of system area • Mostly block RAMs • Little logic • No clock frequency impact vld.w WB Buffer MISS dirty lines Dcache 9 cycle penalty DDR

Any Miss vs Sequential Vector Miss Collinear – nearly all misses in our benchmarks are sequential vector

Vector Length Prefetching • Previously: constant# cache lines prefetched • Now: Use multiple of vector length • Only for sequential vector memory instructions • Eg. Vector load of 32 elements • Guarantees <= 1 miss per vector memory instr 0 31 vld.w fetch + prefetch 28*k

Vector Length Prefetching - Performance 1*VL prefetching provides good speedup without tuning, 8*VL best Peak 29% Not receptive 21% 2.2x no cache pollution

Overall Memory System Performance Wider line + prefetching reduces memory unit stall cycles significantly Wider line + prefetching eliminates all but4% of miss cycles 67% 48% 31% 4% (4KB) (16KB) 15

Improved Scalability Previous: 3-8x range, average of 5x for 16 lanes Now: 6-13x range, average of 10xfor 16 lanes

Summary • Explored cache design • ~2x performance for ~2x system area • Area growth due largely to memory crossbar • Widened cache line size to 64B and depth to 16KB • Enhanced VESPA w/ hardware data prefetching • Up to 2.2x performance, average of 28% for K=15 • Vector length prefetcher gains 21% on average for 1*VL • Good for mixed workloads, no tuning, no cache pollution • Peak at 8*VL, average of 29% speedup • Overall improved VESPA memory system & scalability • Decreased miss cycles to 4%, • Decreased memory unit stall cycles to 31%

Vector Memory Unit Memory Request Queue base rddata0 rddata1 stride*0 rddataL M U X stride*1 M U X + ... + stride*L M U X + index0 index1 indexL wrdata0 ... … Memory Lanes=4 wrdata1 wrdataL Dcache Read Crossbar Write Crossbar L = # Lanes - 1 Memory Write Queue … …

Improving Memory System Performance for Soft Vector Processors

Improving Memory System Performance for Soft Vector Processors

Presentation Transcript

Vector Processors

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

Application-Specific Signatures for Transactional Memory in Soft Processors

Vector processors

Memory Hierarchy—Improving Performance

Soft Vector Processors with Streaming Pipelines

Strategies for Improving Memory

VEGAS: Soft Vector Processor with Scratchpad Memory

Vector Processors Part 2

Performance Vector

Memory System Performance

802.11ai – Improving WLAN System Performance

Challenges for High Performance Processors

Custom Code Generation for Soft Processors

Improving Pipelined Soft Processors with Multithreading

Vector Class on Limited Local Memory (LLM) Multi-core Processors

Chapter 4 Vector Processors

Scalable Vector Processors for Embedded Systems

Fine-Grain Performance Scaling of Soft Vector Processors

Improving Memory System Performance for Soft Vector Processors

Chapter 4 Vector Processors

Improving Database Performance on Simultaneous Multithreading Processors