160 likes | 244 Views
Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? . Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab Virginia Tech. Outline. Background Large Integer Multiplication GIMPS Algorithm Comparison Floating-point FFT All-integer FFT
E N D
Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab Virginia Tech Craven
Outline • Background • Large Integer Multiplication • GIMPS • Algorithm Comparison • Floating-point FFT • All-integer FFT • Fast Galois Transform • Accelerator Design • System Design • Operation • Performance • Improvements & Future Work Craven
Large Integer Multiplication • Complexity • Grade School: O(N2) • Fourier Transform: ~O(N log N) • Efficient FFT-Based Multiplication • Divide integers into sequences of smaller digits. 867530924601 86, 75, 30, 92, 46, 01 • Convolution of two sequences equivalent to multiplication. • Element-wise multiplication in frequency domain time domain convolution. Craven
GIMPS • Why multiply big numbers? • Great Internet Mersenne Prime Search (GIMPS) • Primality testing algorithm for Mersenne numbers (2q – 1) requires squaring of multi-million digit numbers. • Mersenne primes are largest primes known – used in cryptography. • Large integer convolution • Performance comparison of Pentiums and FPGAs in traditional floating-point domains. • Lucas-Lehmer Primality Test Mq = 2q – 1; v = 4; for i = 1:q-2, v = v2 – 2 (mod Mq); if v == 0, Mq is prime else, Mq is composite Craven
Discrete Weighted Transform • Discrete Weighted Transform (DWT) • Variable base – each sequence digit can contain differing numbers of bits. • Creates power-of-two sequence needed by FFT. • Eliminates need to zero pad to convert cyclic, FFT-based convolution into acyclic convolution needed for squaring. • Steps: • Number to be multiplied divided into variable-length digits. • Sequence multiplied by a weight sequence. • FFT performed on new, power-of-two length weighted sequence. • Example for Mq = 237 – 1 with FFT length of 4: • Bits / digit = { 10, 9, 9, 9 } • To square 78,314,567,209 (mod Mq), our sequence would be: { 553, 93, 381, 291 } • 553 + 93 * 210 + 381 * 219 + 291 * 228 = 78,314,567,209 • Multiply sequence by weights then FFT. Craven
Objective • Compare performance of Pentium processors to FPGAs. • GIMPS chosen because highly optimized code exists. • GIMPS utilizes fast floating-point performance of Pentiums. • Xilinx Virtex-II Pro 100 (2VP100) chosen as target device. • Largest available 2VP device. • Contains 444, 17x17 unsigned multipliers • 888kB of embedded Block RAM • Target 12 million digit numbers. • Reward for first prime above 10 million. Craven
Floating-point FFT • GIMPS implementation uses floating-point – requires round off error checks. • Using near double-precision floating-point (51-bit mantissa): • 49 real multipliers can be placed on 2VP100 • 12 complex multipliers • 12 million digit number -> 2 million point FFT • 44 million complex multiplies -> 3.7 million cycles Craven
All-integer FFT • Perform FFT modulo special prime. • Prime must have nice roots of one & two. • Reductions modulo prime should be simple. • Primes of the form 2k – 2m + 1 meet requirements. Craven
Fast Galois Transform • All-integer transform using complex numbers modulo a Mersenne Prime: a + b*i (mod Mp) • Real input sequence folded into complex input with half the length. • Modular reductions via Mersenne primes are simple addition. Craven
Algorithm Selection • Considered algorithms: • Floating-point FFT 3.7M cycles / iteration • All-integer FFT 1.7M cycles / iteration • Galois Transform 3.3M cycles / iteration • Winograd Transform – no acceptable run lengths • Chinese Remainder Theorem – added complexity Craven
FFT Design • Multipliers and adder generated by CoreGen. • 10 cycle butterfly latency. Craven
Complete Design • 8-point FFTs lower cache throughput. • Multiple caches allow for overlapping computation with memory reads and writes. Craven
Performance Estimates • XC2VP100-6ff1696 • ISE version 6.2i • Iteration time: 34 milliseconds • FFT Engine frequency: 80 MHz • 2VP 100 utilization: 70% slices * Not Implemented 24% BRAMs 86% multipliers Craven
Performance Comparison • Pentium 4 Performance: • Non-SIMD (64-bit multiplies) • 6.4 GFLOPs • All-Integer transform leverages FPGA strengths: • 1.9 billion integer multiplies /sec • Transform performance exceeds P4. • FPGA vs. Pentium 4: • 34 ms vs. 60 ms => 1.76x speed-up! • $10,000 vs. $500 => 20x more costly. • 600 sq mm*vs. 146 sq mm => 4.1x more die area.† FPGAs would likely be less costly if volume equaled the P4. † The P4 area estimate does not include the area required by all of the support chips. * 2VP100 die area extrapolated from 2VP20 data supplied by Semiconductor Insights (www.semiconductor.com). Craven
Improvements & Future Work • Pentium assemble code highly-optimized while HW accelerator is a first draft. • Algorithm exploration • Nussbaumer’s method using 17-bit primes • Utilize “nice” form of prime to implement shift-only multiply for first two FFT stages. • Cluster Implementation • Configurable Computing Lab constructing a 16-node 2VP cluster with gigabit transceivers as interconnect. • Alternative reduced-multiplier butterfly structures • Floorplanning Craven
Conclusions • All-integer FFTs attractive for hardware implementations of filters / convolutions. • GIMPS accelerator designed: • Operates at 80 MHz • 176% faster than 3.2 GHz Pentium 4 • Cost of accelerator outweighs benefit in this application. Craven