860 likes | 981 Views
Accelerator Compiler for the VENICE Vector Processor. Zhiduo Liu Supervisor: Guy Lemieux Sep. 28 th , 2012. Outline:. Motivation Background Implementation Results Conclusion. Outline:. Motivation Background Implementation Results Conclusion. FPGA. VHDL. Motivation. Multi-core.
E N D
Accelerator Compiler for the VENICE Vector Processor Zhiduo Liu Supervisor: Guy Lemieux Sep. 28th, 2012
Outline: Motivation Background Implementation Results Conclusion
Outline: Motivation Background Implementation Results Conclusion
FPGA VHDL Motivation Multi-core ParC Cilk Erlang System Verilog Verilog OpenMP OpenCL aJava SSE MPI Bluespec OpenGL Pthread GPU X10 CUDA StreamIt Sh OpenHMPP Many-core Fortress Sponge Chapel … Computer clusters Vector Processor
Simplification FPGA VHDL Motivation Multi-core ParC Cilk Erlang System Verilog Verilog OpenMP OpenCL aJava SSE MPI Bluespec OpenGL Pthread GPU X10 CUDA StreamIt Sh OpenHMPP Many-core Fortress Sponge Chapel … Computer clusters Vector Processor
Motivation Single Description …
Contributions The compiler serves as a new back-end of a single-description multiple-device language. The compiler makes VENICE easier to program and debug. The compiler provides auto-parallelization and optimization. [1] Z. Liu, A. Severance, S. Singh and G. Lemieux, “Accelerator Compiler for the VENICE Vector Processor,” in FPGA 2012. [2] C. Chou, A. Severance, A. Brant, Z. Liu, S. Sant, G. Lemieux, “VEGAS: soft vector processor with scratchpad memory,” in FPGA 2011.
Outline: Motivation Background Implementation Results Conclusion
Complicated ALIGN WR RD ALIGN EX1 EX2 ACCUM
#include "vector.h“ int main() { int A[] = {1,2,3,4,5,6,7,8}; const int data_len = sizeof ( A ); int *va = ( int *) vector_malloc ( data_len ); vector_dma_to_vector ( va, A, data_len ); vector_wait_for_dma (); vector_set_vl ( data_len / sizeof (int) ); vector ( SVW, VADD, va, 42, va ); vector_instr_sync(); vector_dma_to_host ( A, va, data_len ); vector_wait_for_dma (); vector_free (); } Program in VENICE assembly • Allocate vectors in scratchpad • Move data from main memory to scratchpad • Wait for DMA transaction to be completed • Setup for vector instructions • Perform vector computations • Wait for vector operations to be completed • Move data from scratchpad to main memory • Wait for DMA transaction to be completed • Deallocate memory from scratchpad
Program in Accelerator • Create a Target • Create Parallel Array objects • Write expressions • Call ToArray to evaluate expressions • Delete Target object #include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { int A[] = {1,2,3,4,5,6,7,8}; Target *tgt = CreateVectorTarget(); IPA b = IPA( A, sizeof (A)/sizeof (int)); IPA c = b + 42; tgt->ToArray( c, A, sizeof (A)/sizeof (int)); tgt->Delete(); } Target *tgt = CreateMulticoreTarget(); Target *tgt= CreateDX9Target();
Assembly Programming : Accelerator Programming : Write in Accelerator Write Assembly Compile with Microsoft Visual Studio Doesn’t compile? Or result incorrect? Compile with Gcc Compile with Gcc Doesn’t compile? Download to board Download to board Get Result Get Result Result Incorrect?
Assembly Programming : • Hard to program • Long debug cycle • Not portable • Manual – Not always optimal or correct (wysiwyg) • Accelerator Programming : • Easy to program • Easy to debug • Can also target other devices • Automated compiler optimizations
Outline: Motivation Background Implementation Results Conclusion
D #include "Accelerator.h" using namespace ParallelArrays; using namespace MicrosoftTargets; int main() { Target *tgtVector = CreateVectorTarget(); const int length = 8192; int a[] = {1,2,3,4, … , 8192}; int d[length]; IPA A = IPA( a, length); IPA B = Evaluate( Rotate(A, [1]) + 1 ); IPA C = Evaluate( Abs( A + 2 )); IPA D = ( A + B ) * C ; tgtVector->ToArray( D, d, length * sizeof(int)); tgtVector->Delete(); } × Abs + + A + 2 A 1 Rot A
D × Abs + + A + 2 A 1 Rot A
D × Abs + + A + 2 A 1 A (rot)
C B D Abs + × 1 + A (rot) Abs + 2 A D + A + × 2 A 1 A (rot) C + A B
C Combine Operations Abs + B 2 A D + × 1 A (rot) C + A B
C Combine Operations |+| A 2 B D + × 1 A (rot) C + A B
Scratchpad Memory “Virtual Vector Register File”
“Virtual Vector Register File” Number of vector registers = ? Vector register size = ?
“Virtual Vector Register File” Number of vector registers = ? Vector register size = ?
C Evaluation Order B + 2 A (rot) + 5 2 D 1 A (rot) 1 3 1 3 1 4 2 3 × 0 0 1 2 1 2 1 1 C + 1 1 2 1 A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B
C Count number of virtual vector registers B + 2 A (rot) + D 1 A (rot) × C + A B