VEGAS: Soft Vector Processor with Scratchpad Memory

VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University of British Columbia

Motivation • Embedded processing on FPGAs • High performance, computationally intensive • Soft processors, e.g. Nios/MicroBlaze, too slow • How to deliver High Performance? • Multiprocessor on FPGA • Custom Hardware accelerators (Verilog RTL) • Synthesized accelerators (C to FPGA)

Motivation • Soft vector processor to the rescue • Previous works have demonstrated soft vector processor as a viable option to provide: • Scalable performance and area • Purely software-based • Decouples hardware/software development • Key performance bottlenecks • Memory access latency • On-chip data storage efficiency

Contribution • VEGAS Architecture key features • Cacheless Scratchpad Memory • Fracturable ALUs • Concurrent memory access via DMA • Advantages • Eliminates on-chip data replication • Also: huge # of vectors, long vector lengths • More parallel ALUs • Fewer memory loads/stores

VEGAS Architecture Vector Core: VEGAS @ 120MHz Scalar Core: NiosII/f @ 200MHz Concurrent Execution FIFO synchronized VEGAS DMA Engine & External DDR2

Scratchpad Memory in Action Dest Dest srcB srcB srcA srcA Vector Scratchpad Memory Vector Lane 0 Vector Lane 1 Vector Lane 2 Vector Lane 3

Scratchpad Memory in Action Dest srcA

Scratchpad Advantage • Performance • Huge working set (256kB++) • Explicitly managed by software • Async load/store via concurrent DMA • Efficient data storage • Double-clocked memory (Trad. RF 2x copies) • 8b data stays as 8b (Trad. RF 4x copies) • No cache (Trad. RF +1 copy)

Scratchpad Advantage • Accessed by address register • Huge # of vectors in scratchpad • VEGAS uses only 8 vector addr. reg. (V0..V7) • Modify content to access different vectors • Auto-increment lessens need to change V0..V7 • Long vector lengths • Fill entire scratchpad

Scratchpad Advantage: Median Filter • Vector address registers  easier than unrolling • Traditional Vector Median Filter For J = 0..12 For I = J .. 24 V1 = vector[i]  vector load V2 = vector[j]  vector load CompareAndSwap( V1, V2 ) vector[j] = V2  vector store Vector[i] = V1  vector store • Optimize away 1 vector load + 1 vector store using temp • Total of 222 loads and 222 stores

Scratchpad Advantage: Median Filter

Fracturable ALUs Multiplier – uses 4 x 16b multipliers Multiplier also does shifts + rotate Adder – uses 4 x 8b adders

Fracturable ALUs Advantage • Increased processing power • 4-Lane VEGAS • 4 x 32b operations / cycle • 8 x 16b operations / cycle • 16 x 8b operations / cycle • Median filter example • 32b data: 184 cycles / pixel • 16b data: 93 cycles / pixel • 8b data: 47 cycles / pixel

Area and Frequency

ALM Usage

Performance

Area-Delay Product • Area*Delay measures “throughput per mm2” • Compared to earlier vector processors, VEGAS offers 2-3x better throughput per unit area

Integer Matrix Multiply • Integer Matrix Multiply • 4096 x 4096 integers (64MB data set) • Intel Core 2 (65nm), 2.5GHz, 16GB DDR2 • Vanilla IJK: 474 seconds • Vanilla KIJ: 134 s • Tiled IJK: 93 s • Tiled KIJ: 68 s • VEGAS (65nm Altera Stratix3) • Vector: 44 s (Nios only: 5407 s) • 256kB Scratchpad, 32 Lanes (about 50% of chip) • 200MHz NIOS, 100MHz Vector, 1GB DDR2 SODIMM

Conclusions • Key features • Scratchpad Memory • Enhance performance with fewer loads/stores • No on-chip data replication; efficient storage • Double-clocked to hide memory latency • Fracturable ALUs • Operates on 8b, 16b, 32b data efficiently • Single vector core accelerates many applications • Result • 2-3x better Area-Delay product than VIPERS/VESPA • Out performs Intel Core 2 at Integer Matrix Multiply

Issues / Future Work • No floating-point yet • Adding “complex function” support, to include floating-point or similar operations • Algorithms with only short vectors • Split vector processor into 2, 4, 8 pieces • Run multiple instances of algorithm • Multiple vector processors • Connecting them to work cooperatively • Goals: increase throughput, exploit task-level parallelism (ie, chaining or pipelining)

VEGAS: Soft Vector Processor with Scratchpad Memory

VEGAS: Soft Vector Processor with Scratchpad Memory

Presentation Transcript

Memory Management and Processor Management

Vector Processor

Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

Memory Hierarchy- Power 5 Processor

Memory/Processor

VEGAS: A Soft Vector Processor

Specific Choice of Soft Processor Features

Soft Vector Processors with Streaming Pipelines

Processor with Integrated DRAM Main Memory

Processor and Memory Organisation

Efficient HPC Data Motion via Scratchpad Memory

VIPERS II: A Soft-core Vector Processor with Single-copy Data Scratchpad Memory

Configurable Soft Processor Arrays Using the OpenFire Processor

Single Processor Machines: Memory Hierarchies and Processor Features

Microblaze Soft Processor Core

Improving Memory System Performance for Soft Vector Processors

Processor and Memory organization – Lesson-1 Processor organization

A Dynamic Scratchpad Memory Collaborated with Cache Memory for Transferring Data for Multi bit Error Correction

Processor with Integrated DRAM Main Memory

Implementing Virtual Memory in a Vector Processor with Software Restart Markers