MEMOCODE 2007 HW/SW Co-design Contest

MEMOCODE 2007HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical and Computer Engineering Department Virginia Tech

Table of Contents • Section 1 Performance Evaluation and Analysis • Section 2 Matrix Multiplication Algorithm Optimization • Section 3 HW/SW System Implementation • Section 4 Co-design Flow and Methodology • Section 5 Conclusion

Section 1Performance Evaluation and Analysis

Performance Results Section 1 Performance Evaluation and Analysis

Performance Calculation • FCPU-Speed = 1, we used 300Mhz PPC • FFPGA-Capacity = 1, we used XUP’s XC2VP30 • FFPGA-speed = 1, we used 100Mhz clock for bus and coprocessor • TimeEffective = (Tmeas,N=1024 + Tmeas,N=256 * 64) * FCPU-Speed *FFPGA-Capacity * FFPGA-speed = (11.882 + 64*0.217) * 1 * 1 * 1 = 25.77 seconds Section 1 Performance Evaluation and Analysis

Performance Results Section 1 Performance Evaluation and Analysis

Section 2Matrix Multiplication Algorithm Optimization

Algorithm Optimization • Algorithm is optimized based on targeting platform (Virtex2 Pro VP30) • Optimization goal: • Best utilized the slow DDR Memory Interface • Optimally 128-bit/cycle transfers => 4 Complex Numbers • Linear accesses result in better throughput • Utilize as many fast discrete FPGA Resources as possible • 136 18x18-Hardware Multipliers • 136 18kbits Block Rams Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B Bring in 4 complex numbers from “A” A C Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B • Bring in four numbers from “B” and perform the following calculations: • C[0][0] = C[0][0] + A[0][0]*B[0][0] • C[0][1] = C[0][0] + A[0][0]*B[0][1] • C[0][2] = C[0][0] + A[0][0]*B[0][2] • C[0][3] = C[0][0] + A[0][0]*B[0][3] • … • C[8][0] = C[8][0] + A[8][0]*B[0][0] • C[8][1] = C[8][0] + A[8][0]*B[0][1] • C[8][2] = C[8][0] + A[8][0]*B[0][2] • C[8][3] = C[8][0] + A[8][0]*B[0][3] • Where “A*B” is a complex multiplication. • 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle A C Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM A C Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated Optimized Algorithm B A C C Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated Optimized Algorithm B Next, we repeat the previous algorithm to calculate the next “8xN CSlice” A C Section 2 Matrix Multiplication Algorithm Optimization

Optimized Algorithm • Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers) • Linear scan through B matrix (optimizing interface to DDR storage) Section 2 Matrix Multiplication Algorithm Optimization

Section 3HW/SW System Implementation

System Architecture Processor Local Bus Section 3 HW/SW System Implementation

Coprocessor Architecture vs. Optimized Algorithm • Minor deviation from proposed algorithm • I/O size for coprocessor: B elements are loaded 2 at a time instead of 4 • PLB DMA failed to function resulting in a much slower {DDR->PPC->Coprocessor FIFO} datapath. • FIFO width of 64-bit => 2-number sends from PPC to Coprocessor FIFO • To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN • Still utilizes 128 Hardware Multipliers. Section 3 HW/SW System Implementation

Coprocessor Architecture • Coprocessor is scalable! • Reduce the depth of the A-matrix subblock to reduce the amount of MAC needed Section 3 HW/SW System Implementation

Coprocessor Architecture Section 3 HW/SW System Implementation

MAC Unit Architecture Section 3 HW/SW System Implementation

MAC Unit Architecture Input “B” Value “A” Values Complex Multiply Accumulate BlockRAM Storage for current “C” value Section 3 HW/SW System Implementation

Section 4Co-design Flow and Methodology

Design Flow Reference C Algorithm Rectangular-Block Transformation Optimized C Algorithm Manual Partitioning Driver C Algorithm GEZEL Coprocessor Cosimulation PPC Binary VHDL Synthesis XUP Board PerformanceAnalysis Section 4 Co-design Flow and Methodology

Simulation Reference C Algorithm Optimized C Algorithm workstation cycle-based instruction-set cosimulator Driver C Algorithm GEZEL Coprocessor PPC Binary VHDL XUP Board FPGA Section 4 Co-design Flow and Methodology

Simulation • Simulation-based verification on three levels • workstation (behavioral) • cycle-based ISS (functional model of coprocessor) • FPGA board (skipping VHDL simulation since synthesis is swift and easy) • Drawback - simulations capture only behavior, but not the architecture. • Example: Hard to estimate post-synthesis timing • Example: Hard to reflect memory-bus behavior (DMA, DDR, ...) in a C simulation model Section 4 Co-design Flow and Methodology

MEMOCODE 2007 HW/SW Co-design Contest