220 likes | 422 Views
VEGAS: Soft Vector Processor with Scratchpad Memory. Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia. Motivation. Embedded processing on FPGAs High performance, computationally intensive
E N D
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia
Motivation • Embedded processing on FPGAs • High performance, computationally intensive • Soft processors, e.g. Nios/MicroBlaze, too slow • How to deliver High Performance? • Multiprocessor on FPGA • Custom Hardware accelerators (Verilog RTL) • Synthesized accelerators (C to FPGA)
Motivation • Soft vector processor to the rescue • Previous works have demonstrated soft vector processor as a viable option to provide: • Scalable performance and area • Purely software-based • Decouples hardware/software development • Key performance bottlenecks • Memory access latency • On-chip data storage efficiency
Contribution • VEGAS Architecture key features • Cacheless Scratchpad Memory • Fracturable ALUs • Concurrent memory access via DMA • Advantages • Eliminates on-chip data replication • Also: huge # of vectors, long vector lengths • More parallel ALUs • Fewer memory loads/stores
VEGAS Architecture Vector Core: VEGAS @ 120MHz Scalar Core: NiosII/f @ 200MHz Concurrent Execution FIFO synchronized VEGAS DMA Engine & External DDR2
Scratchpad Memory in Action Dest Dest srcB srcB srcA srcA Vector Scratchpad Memory Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane 3
Scratchpad Memory in Action Dest srcA
Scratchpad Advantage • Performance • Huge working set (256kB++) • Explicitly managed by software • Async load/store via concurrent DMA • Efficient data storage • Double-clocked memory (Trad. RF 2x copies) • 8b data stays as 8b (Trad. RF 4x copies) • No cache (Trad. RF +1 copy)
Scratchpad Advantage • Accessed by address register • Huge # of vectors in scratchpad • VEGAS uses only 8 vector addr. reg. (V0..V7) • Modify content to access different vectors • Auto-increment lessens need to change V0..V7 • Long vector lengths • Fill entire scratchpad
Scratchpad Advantage: Median Filter • Vector address registers easier than unrolling • Traditional Vector Median Filter For J = 0..12 For I = J .. 24 V1 = vector[i] vector load V2 = vector[j] vector load CompareAndSwap( V1, V2 ) vector[j] = V2 vector store Vector[i] = V1 vector store • Optimize away 1 vector load + 1 vector store using temp • Total of 222 loads and 222 stores
Fracturable ALUs Multiplier – uses 4 x 16b multipliers Multiplier also does shifts + rotate Adder – uses 4 x 8b adders
Fracturable ALUs Advantage • Increased processing power • 4-Lane VEGAS • 4 x 32b operations / cycle • 8 x 16b operations / cycle • 16 x 8b operations / cycle • Median filter example • 32b data: 184 cycles / pixel • 16b data: 93 cycles / pixel • 8b data: 47 cycles / pixel
Area-Delay Product • Area*Delay measures “throughput per mm2” • Compared to earlier vector processors, VEGAS offers 2-3x better throughput per unit area
Integer Matrix Multiply • Integer Matrix Multiply • 4096 x 4096 integers (64MB data set) • Intel Core 2 (65nm), 2.5GHz, 16GB DDR2 • Vanilla IJK: 474 seconds • Vanilla KIJ: 134 s • Tiled IJK: 93 s • Tiled KIJ: 68 s • VEGAS (65nm Altera Stratix3) • Vector: 44 s (Nios only: 5407 s) • 256kB Scratchpad, 32 Lanes (about 50% of chip) • 200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM
Conclusions • Key features • Scratchpad Memory • Enhance performance with fewer loads/stores • No on-chip data replication; efficient storage • Double-clocked to hide memory latency • Fracturable ALUs • Operates on 8b, 16b, 32b data efficiently • Single vector core accelerates many applications • Result • 2-3x better Area-Delay product than VIPERS/VESPA • Out performs Intel Core 2 at Integer Matrix Multiply
Issues / Future Work • No floating-point yet • Adding “complex function” support, to include floating-point or similar operations • Algorithms with only short vectors • Split vector processor into 2, 4, 8 pieces • Run multiple instances of algorithm • Multiple vector processors • Connecting them to work cooperatively • Goals: increase throughput, exploit task-level parallelism (ie, chaining or pipelining)