1 / 18

The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor

The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor. Anthony T. Jacobson, Dean N. Truong, Bevan M. Baas. VLSI Computation Lab University of California, Davis. Outline. Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results.

darwinm
Download Presentation

The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor Anthony T. Jacobson, Dean N. Truong, Bevan M. Baas VLSI Computation Lab University of California, Davis

  2. Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results

  3. Design Goals The Fast Fourier Transform (FFT) is a ubiquitous DSP algorithm Applications which use FFTs typically require their FFTs to have: High computational throughput Runtime reconfigurability (e.g. cognitive radio) High Signal to Quantization Noise Ratio (SQNR)

  4. Main Features 32-bit complex FFTs (16-bit real, 16-bit imag.) Reconfigurable from 16- to 4k-point IFFTs/FFTs Mixed-Radix Radix-4 computation with final Radix-2 stage, if necessary (for odd n, 2n-point FFTs) Decimation in Time (DIT) addressing Memory-based architecture Lower area compared to pipelined designs Continuous flow for maximum throughput Area efficient twiddle-factor ROM design

  5. Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results

  6. Continuous Flow Architecture 16-bit data words are passed between I/O and memory (1 word real, 1 word imag.) Four 32-bit complex data are read/written by the processing element (FFT butterfly) The FFT’s internal memory consists of two 4k word banks (1 word = 32-bits) 4k word banks allows support for 4096 point FFTs Each bank is partitioned into four “subbanks” for multi-read/writes from/to the processing element Four 1024 word x 32-bit SRAMs One bank is used to read/write from I/O while the other is used to read/write from the processing element wrt_addr SRAM wrt_data rd_data rd_addr Each bank consists of dual-port SRAMs (one read and write per cycle)

  7. Block Diagram Processing Element Memory I/O Interface

  8. Radix-4 DIT Butterfly The computational heart of the FFT is its butterfly consisting of: Three complex multiplications Twelve complex additions Execution broken into two pipeline stages: MULT: Three 16 x 16-bit multipliers ADD: Twelve 4-input 32-bit adders (34-bit sum) ½ LSB rounding and truncation • 16-bit MSB final result Radix-2 DIT butterfly can be achieved by only utilizing A and C as inputs and setting B and D inputs to zero A X = A + CW C Wc• W Y = A – CW

  9. Quantization Considerations Im i -1 1 Re -i Possible location of inputs(assume 1.15) Typical location of W From stage to stage intermediate butterfly results are right shifted by 2 to avoid saturation Twiddle factor constants lie within the unit circle (magnitude ≤ 1), but inputs are not restricted by this Additional configuration option to shift initial input by 1 is provided Block floating point helps increase efficiency by finding the minimum sign bits (redundant bits) over all butterfly results per FFT stage For a worst case sinusoidal input SQNR is improved over 200% Twiddle factors are in 1.15 16-bit fixed-point format Multiply by ±1 and ±i situations are handled through bypassing # sign bits << L S D To Mem From Mem # sign bits = min(prev. # sign bits, current # sign bits)

  10. Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results

  11. I/O Address Generation Memory address Recall that each radix-4 FFT butterfly computation has four inputs and four outputs, which necessitate multiple reads and writes Standard SRAMs only have a single read and single write port, so we break up one 4k word bank into four 1k word “subbanks” To avoid memory conflicts we must ensure that each butterfly in/out accesses different subbanks This requires developing an addressing scheme based on the memory location pattern of a DIT FFT Data Index = {0, 1, …, 2n-1} Data indices in the above sample are after bit reversal, thus they do not represent the actual input order index of N-point (2n) data

  12. Butterfly Address Generation FFT/IFFT is controlled by a primary address counter which is then broken up into “group” and “butterfly” counters (gr…g0 and bs…b0, respectively) For radix-4 s and r is equal to log4(N) for an N-point data set The final butterfly addresses are determined by the two counters The twiddle factor ROM address is determined solely on the butterfly counter base = {(0 – grg0 – … – b1b0), bs…b2} offset = function of memory subbank # butterfly number

  13. Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results

  14. Twiddle Factor ROM Radix-4 twiddle factor: y = index, θN= 2π/N θy≡ yθN W-ROM contains 512 32-bit complex values for θy= [0,π/4] All other factors can be obtained from symmetry and special relationships The upper three bits of the address decodes the ROM outputs to their correct octants Wc Index = 2(Wb Index) Wd Index = 3(Wb Index) = Wc Index + Wb Index

  15. Outline Introduction Architectural Overview Address Generation Twiddle Factor ROM Implementation Results

  16. Implementation Results Fabricated within a 167-processor array ST Microelectronics LP 65 nm CMOS Area: 1 mm2 Initial results: Fully functional up to 866 MHz at 1.3 V Average power at this operating point: 35 mW

  17. MEM MEM Conclusion O F MEM MEM MEM MEM Runtime configurable 4- to 4096-point FFT/IFFTs 32-bit fixed-point complex data High SQNR across all modes: ~80 dB for 64-point ~74 dB for 1024-point High throughput at 866 MHz: 67 ns to compute a 64-point FFT Over 950 Msamples per second 1.5 μs to compute a 1024-point FFT Over 680 Msamples per second MEM MEM

  18. Acknowledgements ST Microelectronics NSF Grant 430090 and CAREER award 546907 Intel SRC GRC Grant 1598 and CSR Grant 1659 Intellasys UC Micro SEM J.-P. Schoellkopf, K. Torki, S. Dumont, Y.-P. Cheng, R. Krishnamurthy and M. Anders

More Related