680 likes | 1.27k Views
Outline. Discrete Fourier TransformFast Fourier TransformRadix-2 Decimation in TimeRadix-2 Decimation in FrequencyRadix-4 AlgorithmsFFT Hardware Implementations for FPGAs and ASICsParallel FFTSerial FFT (In-Place FFT)Semi-Parallel FFTPipeline FFTFFT Implementation IssuesComplex MultiplicationWordlength Growth and ScalingError Analysis.
E N D
1. ECE 699—Digital Signal Processing Hardware ImplementationsLecture 6
Fast Fourier Transforms
2/25/09
2. Outline Discrete Fourier Transform
Fast Fourier Transform
Radix-2 Decimation in Time
Radix-2 Decimation in Frequency
Radix-4 Algorithms
FFT Hardware Implementations for FPGAs and ASICs
Parallel FFT
Serial FFT (In-Place FFT)
Semi-Parallel FFT
Pipeline FFT
FFT Implementation Issues
Complex Multiplication
Wordlength Growth and Scaling
Error Analysis
3. Reading Fast Fourier Transforms
Proakis, Digital Signal Processing
Chapter 8 (4th Edition) = Chapter 6 (3rd Edition)
Pipeline FFT
S. He, M. Torkelson, "A New Approach to Pipeline FFT Processor," Proc. 10th IEEE International Parallel Processing Symposium (IPPS), pp. 766-770, 1996.
4. Discrete Fourier Transform
5. Analog to Digital Conversion Consider an analog signal
xa(t) = Acos(O t) = Acos(2p F t)
A = amplitude
O = 2p F = analog radian frequency (radians/s)
F = analog frequency (cycles/s = Hz)
P = 1/F = period of signal
Perform sampling of analog signal at a rate of Fs (samples/sec) = sampling frequency or sampling rate
Sampling period = T = 1 / Fs
6. Analog to Digital Conversion
7. Analog to Digital Conversion xa(nT) = x(n) = Acos(2p F n T) = A cos(2p n F/Fs)
Define digital frequency f = F/Fs
Define digital radian frequency ? = 2 p f
X(n) = A cos (2p fn) = A cos (?n)
A/D conversion also performs amplitude quantization converting voltage levels to bits (not discussed here)
To prevent aliasing Fs >= 2Fmax
Sampling frequency must be higher than twice the largest analog frequency component Fmax (which is F in this example)
2Fmax is called the Nyquist Rate. The sampling rate must be greater than or equal to Nyquist Rate to avoid aliasing.
The range of f is - ˝ < f < ˝ and the range of w is –p < ? < p
8. Fourier Analysis of Signals Fourier analysis allows us to decompose a signal in the time domain and analyze/process the signal in the frequency domain
9. Fourier Transform Consider an aperiodic signal x(n), the Discrete-Time Fourier Transform (DTFT) of that signal is
Note that X(?) is a continuous-valued signal (not discrete-valued)
Due to this, it is computationally challenging to compute the Discrete-Time Fourier Transform (DTFT)
The DTFT is periodic over ? with a period of 2p
10. Example of DTFT x(n) = A for 0 = n = L-1, 0 otherwise
11. Discrete Fourier Transform Due to computational difficulty in computing continuous-valued Discrete-Time Fourier Transform (DTFT), can use Discrete Fourier Transform (DFT) instead
DFT is a "sampled" version of the DTFT
Take N samples from 0 to 2p such that ? = 2 p k/N for k = 0 to N-1
Spectrum of an aperiodic discrete-time signal with finite duration L can be exactly recovered from its samples at frequency ? = 2 p k/N if N>= L [source: Proakis]
N-point DFT
Also written as
12. Inverse Discrete Fourier Transform Inverse DFT of X(0) to X(N-1) can perfectly reconstruct original time-domain signal x(0) to x(N-1)
13. DFT Properties N-point DFT maps
N inputs in time domain: x(0) through x(N-1)
to N output output in frequency domain: X(0) to X(N-1)
Inherent block processing
If x(n) is real then X(N-k) = X*(k) for k=1 to N/2-1 where * denotes complex conjugate
|X(k)| is symmtric about k=N/2
Even if x(n) is real, X(k) is usually complex (except for X(0) and X(N/2))
X(0) denotes DC frequency, X(N/2) denotes Fs/2 frequency
14. Discrete Time Fourier Transform
15. Discrete Fourier Transform (N=50)
16. Discrete Fourier Transform (N=100)
17. Matlab Example: f=0 (DC), N=32-point DFT n=0:31;
f=0.0;
x = cos(2*pi*n*f);
figure(1); stem(x);
figure(2); stem(abs(fft(x)));
18. Matlab Example: f=0.25, N=32-point DFT n=0:31;
f=0.25;
x = cos(2*pi*n*f);
figure(1); stem(x);
figure(2); stem(abs(fft(x)));
19. Matlab Example: f=0.5 (analog F = Fs/2), N=32-point DFT n=0:31;
f=0.5;
x = cos(2*pi*n*f);
figure(1); stem(x);
figure(2); stem(abs(fft(x)));
20. Matlab Example: random input, , N=32-point DFT n=0:31;
x = -1 + (1 - -1).*rand(length(n),1);
figure(1); stem(x);
figure(2); stem(abs(fft(x)));
21. Example DFT Applications Spectral Estimation
Frequency-Domain Filtering
Interpolation
Implementing MCM (multi-carrier modulation) and OFDM (orthogonal frequency division multiplexing) communication systems
Correlation Analysis
22. Fast Fourier Transform
23. Discrete Fourier Transform Computations The DFT equation:
WN's are also called "twiddle factors" which are complex values around the unit circle in the complex plane
Multiplication by twiddle factor serve to "rotate" a value around the unit circle in the complex plane
Computations:
To compute X(0), require N complex multiplications
To compute X(1), require N complex multiplications
..To compute X(N-1), require N complex multiplications
Total: N2 complex multiplications and N2 – N complex additions to compute N-point DFT
24. Fast Fourier Transform Can exploit shared twiddle factor properties (i.e. sub-expression sharing) to reduce the number of multiplications in DFT
These class of algorithms are called Fast Fourier Transforms
An FFT is simply an efficient implementation of the DFT
Mathematically FFT = DFT
FFT exploits two properties in the twiddle factors:
Symmetry Property:
Periodicity Property:
FFTs use a divide and conquer approach, breaking an N-point DFT into several smaller DFTs
N can be factored as N=r1r2r2…rv where the {ri} are prime
Particular focus on r1=r2=..=rv=r, where r is called the radix of the FFT algorithm
In this case N=rv and the FFT has a regular pattern
We will study radix-2 (r=2) and radix-4 (r=4) FFTs in this class
25. Decimation-in-time Radix-2 FFT Split x(n) into even and odd samples and perform smaller FFTs
f1(n) = x(2n)
f2(n) = x(2n+1)
n=0, 1, … N/2-1
Derivation performed in class
Radix-2 Decimation-in-time (DIT) algorithm
In radix-2, the "butterfly" element takes in 2 inputs and produces 2 outputs
Butterfly implements 2-point FFT
Computations:
(N/2)log2N complex multiplications
Nlog2N complex additions
26. Decimation-in-time Radix-2 FFT (N=8)
27. Decimation-in-time Radix-2 FFT (N=8)
28. Radix-2 DIT Basic Butterfly
29. Decimation-in-frequency Radix-2 FFT Decompose X(k) such that it is split into FFT of points 0 to N/2-1 and points N/2 to N-1
Then decimate X(k) into even and odd numbered samples
Derivation performed in class
Radix-2 Decimation-in-frequency (DIF) algorithm
In radix-2, the "butterfly" element takes in 2 inputs and produces 2 outputs
Butterfly implements 2-point FFT
Computations:
(N/2)log2N complex multiplications
Nlog2N complex additions
30. Decimation-in-frequency Radix-2 FFT (N=8)
31. Decimation-in-frequency Radix-2 FFT (N=8)
32. Radix-2 DIF Basic Butterfly
33. Note on bit-reversed order For radix-2 decimation in frequency FFT
If x(n) is in natural order (n=0,1,2,3,4,5,6,7), then output X(k) is in bit-reversed order (k=0,4,2,6,1,5,3,7)
Vice-versa also holds true
For radix-2 decimation in time FFT
If x(n) is in bit-reversed order (n=0,4,2,6,1,5,3,7), then output X(k) is in natural order (k=0,1,2,3,4,5,6,7)
Vice-versa also holds true
Bit-reversing is an inherent property of FFTs
Radix-4 bit-reversing can be slightly different depending on the algorithm
34. Bit-Reversing
35. Radix-4 FFT In radix-2 you have log2N stages
Can also implement radix-4 and now have log4N stages
Radix-4 Decimation-in-time: split x(n) into four time sequences instead of two
Derivation performed in class
Split x(n) into four decimated sample streams
f1(n) = x(4n)
f2(n) = x(4n+1)
f3(n) = x(4n+2)
f4(n) = x(4n+3)
n=0, 1, .. N/4-1
Radix-4 Decimation-in-time (DIT) algorithm
In radix-4, the "butterfly" element takes in 4 inputs and produces 4 outputs
Butterfly implements 4-point FFT
Computations:
(3N/4)log4N = (3N/8)log2N complex multiplications ? decrease from radix-2 algorithms
(3N/2)log2N complex additions ? increase from radix-2 algorithms
Downside: can only deal with FFTs of a factor of 4, such as N=4, 16, 64, 256, 1024, etc.
36. Radix-4 Basic Butterfly
37. Radix-4 Decimation-in-time FFT (N=16)
38. Radix-4 Decimation-in-frequency FFT (N=16)
39. Higher Radix FFTs Typically do not go to radix-8 or radix-16.
Why?
There do exist algorithms called split-radix algorithms which break FFT into radix-2 and radix-4 FFTs at the same time
Fewer number of multiplications than radix-2 or radix-4
However, the butterfly is irregular and so can be difficult to implement in VLSI
There are many other advanced methods of performing FFTs
40. Split Radix FFT Butterfly
41. Fast Fourier Transform Hardware Implementations
42. FFT Hardware Implementations Implementations of FFTs in hardware (FPGAs and ASICs) can generally be categorized in four categories:
Parallel
Serial/In-Place
Semi-Parallel
Pipeline
Other implementations can be roughly placed in these categories or are a hybrid of these categories
43. Parallel Implementation Implement entire FFT structure in a parallel fashion
Advantages: Control is easy (i.e. no controller), low latency (i.e. 0 cycles in this example), customize each twiddle factor as a multiplication by a constant
Disadvantages: Huge Area, Routing congestion
44. Parallel Implementation with Pipelining This example breaks critical path into a single butterfly (i.e. complex multiply and a complex add)
45. Serial/In-Place FFT Implementation Implement a single butterfly. Use that butterfly and some memory to compute entire FFT
Advantages: Small area
Disadvantages: Large latency, complex controller
46. Serial/In-Place FFT Butterfly
47. Serial/In-Place FFT Operation:
Put input data into data memory
Compute butterfly operation, after which OVERWRITE previous data words (since they are no longer needed)
Move to next butterfly
Example for N=8
Do x(0) and x(4) butterfly, store results in x(0) and x(4) registers
Do x(2) and x(6) butterfly, store results in x(2) and x(6) registers
To complete stage 1, requires (N/2) butterfly operations
Do this for each stage (i.e. stage 1, stage 2, stage 3 for N=8), total of log2N butterfly operations per stage
Total latency: (N/2) log2N cycles to complete an N-point FFT (assuming one butterfly per cycle)
There are many ways to "compress" the twiddle factor memory to take advantage of symmetries and subexpression sharing
48. Semi-Parallel FFT Implementation Implement an entire stage and use memory to store intermediate results (using same override technique as in serial/in-place FFTs)
Advantages: Less area than full parallel
Disadvantages: More complex controller than full parallel, cannot take advantage of fixed multipliers, longer latency than full parallel
49. Pipeline FFT Pipeline FFT is very common for communication systems (OFDM, DMT)
Implements an entire "slice" of the FFT and reuses-hardware to perform other slices
Advantages: Particularly good for systems in which x(n) comes in serially (i.e. no block assembly required), very fast, more area efficient than parallel, can be pipelined
Disadvantages: Controller can become complicated, large intermediate memories may be required between stages, latency of N cycles (more if pipelining introduced)
50. Pipeline FFT Classification Pipeline FFTs can be classified in two groups:
Feedforward structures (Commutator)
Feedback structures
Can also be categorized according to radix:
Radix-2 structures
Radix-4 structures
Etc.
The key to pipeline FFTs is the utilization factor (i.e. want all computations elements active as much as possible)
High utilization factor means higher pipeline FFT efficiency
51. N-Point Pipeline FFT Architectures
52. N-Point Pipeline FFT Architectures
53. Case Study: R22SDF Architecture (He and Torkelson)
54. R22SDF Architecture (He and Torkelson)
55. Hardware Comparison
56. FFT Implementation Issues
57. Complex Multiplier Complex multiplication: z = a x b
58. Complex Multiplier Method 1: 4 real multipliers, 2 real adders
(ar + jai) x (br + jbi)
= [(ar x br) - (ai x bi)] + j[(ai x br) + (ar x bi)]
= zr + jzi
59. Optimized Complex Multiplier Method 2: 3 real multipliers, 5 real adders
(ar + jai) x (br + jbi)
= [ar x (br + bi) - (ar + ai) x bi] + j[ar x (br + bi) - (ar - ai) x br]
= zr + jzi
60. Optimized Complex Multiplier for FFTs Method 3: 3 real multipliers, 3 real adders
If assume a is the twiddle factor, then pre-compute and store (ar + ai) and (ar - ai) in addition to ar
(ar + jai) x (br + jbi)
= [ar x (br + bi) - (ar + ai) x bi] + j[ar x (br + bi) - (ar - ai) x br]
= zr + jzi
61. Xilinx SRL16 Optimization Xilinx SRL16 inside a LUT
62. Fixed point operation will bring significant SQNR (signal to quantization noise ratio) when dealing with a large number of sample
The SQNR decreases as N increases
SQNR ~ 1/4N and SQNR ~ 22B
N-Point FFT; B-Data Width
Based on Welch (1969) [referred to by chapter 9 of Oppenheim] SQNR Considerations
63. FFT Scaling The output of an N-point FFT with a B-bit complex input requires additional log2(N) bits for the output
Input = B bits
Output = B + log2(N) bits
May also require an extra bit for rounding/quantization effects
Solution: divide by two after every radix-2 stage to prevent wordlength growth
Divide by 2 using traditional two's complement rounding (truncation, round half up) has bias ? introduces DC spur
Need more advanced rounding techniques
64. Two's Complement Quantization Bias For division by 2, Truncation eliminates everything to the right of the new LSB
For division by 2, Round (half up) adds '1' to the right of the new LSB, then truncates everything to the right of the new LSB
65. Unbiased Rounding Scenarios Sign Bit Based Rounding
MSB 0 ?round half-up ? positive bias
MSB 1 ? truncate ? negative bias
When the numbers are uniformly distributed, no bias
Requires round hardware at every node
Randomized
Requires LFSR or source of randomization
Balanced Stages Rounding
Good choice for pipeline FFTs
66. SQNR of R22SDF (using 16-bit datapath)
67. Twiddle Factor Storage and Compression Various advanced techniques exist to reduce the number of element stored in a twiddle factor ROM
Exploit symmetry and other properties of the twiddle factor
68. FFT Error Analysis To be covered in class