190 likes | 348 Views
The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor. Anthony T. Jacobson, Dean N. Truong, Bevan M. Baas. VLSI Computation Lab University of California, Davis. Outline. Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results.
E N D
The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor Anthony T. Jacobson, Dean N. Truong, Bevan M. Baas VLSI Computation Lab University of California, Davis
Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results
Design Goals The Fast Fourier Transform (FFT) is a ubiquitous DSP algorithm Applications which use FFTs typically require their FFTs to have: High computational throughput Runtime reconfigurability (e.g. cognitive radio) High Signal to Quantization Noise Ratio (SQNR)
Main Features 32-bit complex FFTs (16-bit real, 16-bit imag.) Reconfigurable from 16- to 4k-point IFFTs/FFTs Mixed-Radix Radix-4 computation with final Radix-2 stage, if necessary (for odd n, 2n-point FFTs) Decimation in Time (DIT) addressing Memory-based architecture Lower area compared to pipelined designs Continuous flow for maximum throughput Area efficient twiddle-factor ROM design
Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results
Continuous Flow Architecture 16-bit data words are passed between I/O and memory (1 word real, 1 word imag.) Four 32-bit complex data are read/written by the processing element (FFT butterfly) The FFT’s internal memory consists of two 4k word banks (1 word = 32-bits) 4k word banks allows support for 4096 point FFTs Each bank is partitioned into four “subbanks” for multi-read/writes from/to the processing element Four 1024 word x 32-bit SRAMs One bank is used to read/write from I/O while the other is used to read/write from the processing element wrt_addr SRAM wrt_data rd_data rd_addr Each bank consists of dual-port SRAMs (one read and write per cycle)
Block Diagram Processing Element Memory I/O Interface
Radix-4 DIT Butterfly The computational heart of the FFT is its butterfly consisting of: Three complex multiplications Twelve complex additions Execution broken into two pipeline stages: MULT: Three 16 x 16-bit multipliers ADD: Twelve 4-input 32-bit adders (34-bit sum) ½ LSB rounding and truncation • 16-bit MSB final result Radix-2 DIT butterfly can be achieved by only utilizing A and C as inputs and setting B and D inputs to zero A X = A + CW C Wc• W Y = A – CW
Quantization Considerations Im i -1 1 Re -i Possible location of inputs(assume 1.15) Typical location of W From stage to stage intermediate butterfly results are right shifted by 2 to avoid saturation Twiddle factor constants lie within the unit circle (magnitude ≤ 1), but inputs are not restricted by this Additional configuration option to shift initial input by 1 is provided Block floating point helps increase efficiency by finding the minimum sign bits (redundant bits) over all butterfly results per FFT stage For a worst case sinusoidal input SQNR is improved over 200% Twiddle factors are in 1.15 16-bit fixed-point format Multiply by ±1 and ±i situations are handled through bypassing # sign bits << L S D To Mem From Mem # sign bits = min(prev. # sign bits, current # sign bits)
Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results
I/O Address Generation Memory address Recall that each radix-4 FFT butterfly computation has four inputs and four outputs, which necessitate multiple reads and writes Standard SRAMs only have a single read and single write port, so we break up one 4k word bank into four 1k word “subbanks” To avoid memory conflicts we must ensure that each butterfly in/out accesses different subbanks This requires developing an addressing scheme based on the memory location pattern of a DIT FFT Data Index = {0, 1, …, 2n-1} Data indices in the above sample are after bit reversal, thus they do not represent the actual input order index of N-point (2n) data
Butterfly Address Generation FFT/IFFT is controlled by a primary address counter which is then broken up into “group” and “butterfly” counters (gr…g0 and bs…b0, respectively) For radix-4 s and r is equal to log4(N) for an N-point data set The final butterfly addresses are determined by the two counters The twiddle factor ROM address is determined solely on the butterfly counter base = {(0 – grg0 – … – b1b0), bs…b2} offset = function of memory subbank # butterfly number
Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results
Twiddle Factor ROM Radix-4 twiddle factor: y = index, θN= 2π/N θy≡ yθN W-ROM contains 512 32-bit complex values for θy= [0,π/4] All other factors can be obtained from symmetry and special relationships The upper three bits of the address decodes the ROM outputs to their correct octants Wc Index = 2(Wb Index) Wd Index = 3(Wb Index) = Wc Index + Wb Index
Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results
Implementation Results Fabricated within a 167-processor array ST Microelectronics LP 65 nm CMOS Area: 1 mm2 Initial results: Fully functional up to 866 MHz at 1.3 V Average power at this operating point: 35 mW
MEM MEM Conclusion O F MEM MEM MEM MEM Runtime configurable 4- to 4096-point FFT/IFFTs 32-bit fixed-point complex data High SQNR across all modes: ~80 dB for 64-point ~74 dB for 1024-point High throughput at 866 MHz: 67 ns to compute a 64-point FFT Over 950 Msamples per second 1.5 μs to compute a 1024-point FFT Over 680 Msamples per second MEM MEM
Acknowledgements ST Microelectronics NSF Grant 430090 and CAREER award 546907 Intel SRC GRC Grant 1598 and CSR Grant 1659 Intellasys UC Micro SEM J.-P. Schoellkopf, K. Torki, S. Dumont, Y.-P. Cheng, R. Krishnamurthy and M. Anders