ECE 699 Digital Signal Processing Hardware Implementations Lecture 6

1. ECE 699�Digital Signal Processing Hardware ImplementationsLecture 6 Fast Fourier Transforms 2/25/09

2. Outline Discrete Fourier Transform Fast Fourier Transform Radix-2 Decimation in Time Radix-2 Decimation in Frequency Radix-4 Algorithms FFT Hardware Implementations for FPGAs and ASICs Parallel FFT Serial FFT (In-Place FFT) Semi-Parallel FFT Pipeline FFT FFT Implementation Issues Complex Multiplication Wordlength Growth and Scaling Error Analysis

3. Reading Fast Fourier Transforms Proakis, Digital Signal Processing Chapter 8 (4th Edition) = Chapter 6 (3rd Edition) Pipeline FFT S. He, M. Torkelson, "A New Approach to Pipeline FFT Processor," Proc. 10th IEEE International Parallel Processing Symposium (IPPS), pp. 766-770, 1996.

4. Discrete Fourier Transform

5. Analog to Digital Conversion Consider an analog signal xa(t) = Acos(O t) = Acos(2p F t) A = amplitude O = 2p F = analog radian frequency (radians/s) F = analog frequency (cycles/s = Hz) P = 1/F = period of signal Perform sampling of analog signal at a rate of Fs (samples/sec) = sampling frequency or sampling rate Sampling period = T = 1 / Fs

6. Analog to Digital Conversion

7. Analog to Digital Conversion xa(nT) = x(n) = Acos(2p F n T) = A cos(2p n F/Fs) Define digital frequency f = F/Fs Define digital radian frequency ? = 2 p f X(n) = A cos (2p fn) = A cos (?n) A/D conversion also performs amplitude quantization converting voltage levels to bits (not discussed here) To prevent aliasing Fs >= 2Fmax Sampling frequency must be higher than twice the largest analog frequency component Fmax (which is F in this example) 2Fmax is called the Nyquist Rate. The sampling rate must be greater than or equal to Nyquist Rate to avoid aliasing. The range of f is - � < f < � and the range of w is �p < ? < p

8. Fourier Analysis of Signals Fourier analysis allows us to decompose a signal in the time domain and analyze/process the signal in the frequency domain

9. Fourier Transform Consider an aperiodic signal x(n), the Discrete-Time Fourier Transform (DTFT) of that signal is Note that X(?) is a continuous-valued signal (not discrete-valued) Due to this, it is computationally challenging to compute the Discrete-Time Fourier Transform (DTFT) The DTFT is periodic over ? with a period of 2p

10. Example of DTFT x(n) = A for 0 = n = L-1, 0 otherwise

11. Discrete Fourier Transform Due to computational difficulty in computing continuous-valued Discrete-Time Fourier Transform (DTFT), can use Discrete Fourier Transform (DFT) instead DFT is a "sampled" version of the DTFT Take N samples from 0 to 2p such that ? = 2 p k/N for k = 0 to N-1 Spectrum of an aperiodic discrete-time signal with finite duration L can be exactly recovered from its samples at frequency ? = 2 p k/N if N>= L [source: Proakis] N-point DFT Also written as

12. Inverse Discrete Fourier Transform Inverse DFT of X(0) to X(N-1) can perfectly reconstruct original time-domain signal x(0) to x(N-1)

13. DFT Properties N-point DFT maps N inputs in time domain: x(0) through x(N-1) to N output output in frequency domain: X(0) to X(N-1) Inherent block processing If x(n) is real then X(N-k) = X*(k) for k=1 to N/2-1 where * denotes complex conjugate |X(k)| is symmtric about k=N/2 Even if x(n) is real, X(k) is usually complex (except for X(0) and X(N/2)) X(0) denotes DC frequency, X(N/2) denotes Fs/2 frequency

14. Discrete Time Fourier Transform

15. Discrete Fourier Transform (N=50)

16. Discrete Fourier Transform (N=100)

17. Matlab Example: f=0 (DC), N=32-point DFT n=0:31; f=0.0; x = cos(2*pi*n*f); figure(1); stem(x); figure(2); stem(abs(fft(x)));

18. Matlab Example: f=0.25, N=32-point DFT n=0:31; f=0.25; x = cos(2*pi*n*f); figure(1); stem(x); figure(2); stem(abs(fft(x)));

19. Matlab Example: f=0.5 (analog F = Fs/2), N=32-point DFT n=0:31; f=0.5; x = cos(2*pi*n*f); figure(1); stem(x); figure(2); stem(abs(fft(x)));

20. Matlab Example: random input, , N=32-point DFT n=0:31; x = -1 + (1 - -1).*rand(length(n),1); figure(1); stem(x); figure(2); stem(abs(fft(x)));

21. Example DFT Applications Spectral Estimation Frequency-Domain Filtering Interpolation Implementing MCM (multi-carrier modulation) and OFDM (orthogonal frequency division multiplexing) communication systems Correlation Analysis

22. Fast Fourier Transform

23. Discrete Fourier Transform Computations The DFT equation: WN's are also called "twiddle factors" which are complex values around the unit circle in the complex plane Multiplication by twiddle factor serve to "rotate" a value around the unit circle in the complex plane Computations: To compute X(0), require N complex multiplications To compute X(1), require N complex multiplications ..To compute X(N-1), require N complex multiplications Total: N2 complex multiplications and N2 � N complex additions to compute N-point DFT

24. Fast Fourier Transform Can exploit shared twiddle factor properties (i.e. sub-expression sharing) to reduce the number of multiplications in DFT These class of algorithms are called Fast Fourier Transforms An FFT is simply an efficient implementation of the DFT Mathematically FFT = DFT FFT exploits two properties in the twiddle factors: Symmetry Property: Periodicity Property: FFTs use a divide and conquer approach, breaking an N-point DFT into several smaller DFTs N can be factored as N=r1r2r2�rv where the {ri} are prime Particular focus on r1=r2=..=rv=r, where r is called the radix of the FFT algorithm In this case N=rv and the FFT has a regular pattern We will study radix-2 (r=2) and radix-4 (r=4) FFTs in this class

25. Decimation-in-time Radix-2 FFT Split x(n) into even and odd samples and perform smaller FFTs f1(n) = x(2n) f2(n) = x(2n+1) n=0, 1, � N/2-1 Derivation performed in class Radix-2 Decimation-in-time (DIT) algorithm In radix-2, the "butterfly" element takes in 2 inputs and produces 2 outputs Butterfly implements 2-point FFT Computations: (N/2)log2N complex multiplications Nlog2N complex additions

26. Decimation-in-time Radix-2 FFT (N=8)

27. Decimation-in-time Radix-2 FFT (N=8)

28. Radix-2 DIT Basic Butterfly

29. Decimation-in-frequency Radix-2 FFT Decompose X(k) such that it is split into FFT of points 0 to N/2-1 and points N/2 to N-1 Then decimate X(k) into even and odd numbered samples Derivation performed in class Radix-2 Decimation-in-frequency (DIF) algorithm In radix-2, the "butterfly" element takes in 2 inputs and produces 2 outputs Butterfly implements 2-point FFT Computations: (N/2)log2N complex multiplications Nlog2N complex additions

30. Decimation-in-frequency Radix-2 FFT (N=8)

31. Decimation-in-frequency Radix-2 FFT (N=8)

32. Radix-2 DIF Basic Butterfly

33. Note on bit-reversed order For radix-2 decimation in frequency FFT If x(n) is in natural order (n=0,1,2,3,4,5,6,7), then output X(k) is in bit-reversed order (k=0,4,2,6,1,5,3,7) Vice-versa also holds true For radix-2 decimation in time FFT If x(n) is in bit-reversed order (n=0,4,2,6,1,5,3,7), then output X(k) is in natural order (k=0,1,2,3,4,5,6,7) Vice-versa also holds true Bit-reversing is an inherent property of FFTs Radix-4 bit-reversing can be slightly different depending on the algorithm

34. Bit-Reversing

35. Radix-4 FFT In radix-2 you have log2N stages Can also implement radix-4 and now have log4N stages Radix-4 Decimation-in-time: split x(n) into four time sequences instead of two Derivation performed in class Split x(n) into four decimated sample streams f1(n) = x(4n) f2(n) = x(4n+1) f3(n) = x(4n+2) f4(n) = x(4n+3) n=0, 1, .. N/4-1 Radix-4 Decimation-in-time (DIT) algorithm In radix-4, the "butterfly" element takes in 4 inputs and produces 4 outputs Butterfly implements 4-point FFT Computations: (3N/4)log4N = (3N/8)log2N complex multiplications ? decrease from radix-2 algorithms (3N/2)log2N complex additions ? increase from radix-2 algorithms Downside: can only deal with FFTs of a factor of 4, such as N=4, 16, 64, 256, 1024, etc.

36. Radix-4 Basic Butterfly

37. Radix-4 Decimation-in-time FFT (N=16)

38. Radix-4 Decimation-in-frequency FFT (N=16)

39. Higher Radix FFTs Typically do not go to radix-8 or radix-16. Why? There do exist algorithms called split-radix algorithms which break FFT into radix-2 and radix-4 FFTs at the same time Fewer number of multiplications than radix-2 or radix-4 However, the butterfly is irregular and so can be difficult to implement in VLSI There are many other advanced methods of performing FFTs

40. Split Radix FFT Butterfly

41. Fast Fourier Transform Hardware Implementations

42. FFT Hardware Implementations Implementations of FFTs in hardware (FPGAs and ASICs) can generally be categorized in four categories: Parallel Serial/In-Place Semi-Parallel Pipeline Other implementations can be roughly placed in these categories or are a hybrid of these categories

43. Parallel Implementation Implement entire FFT structure in a parallel fashion Advantages: Control is easy (i.e. no controller), low latency (i.e. 0 cycles in this example), customize each twiddle factor as a multiplication by a constant Disadvantages: Huge Area, Routing congestion

44. Parallel Implementation with Pipelining This example breaks critical path into a single butterfly (i.e. complex multiply and a complex add)

45. Serial/In-Place FFT Implementation Implement a single butterfly. Use that butterfly and some memory to compute entire FFT Advantages: Small area Disadvantages: Large latency, complex controller

46. Serial/In-Place FFT Butterfly

47. Serial/In-Place FFT Operation: Put input data into data memory Compute butterfly operation, after which OVERWRITE previous data words (since they are no longer needed) Move to next butterfly Example for N=8 Do x(0) and x(4) butterfly, store results in x(0) and x(4) registers Do x(2) and x(6) butterfly, store results in x(2) and x(6) registers To complete stage 1, requires (N/2) butterfly operations Do this for each stage (i.e. stage 1, stage 2, stage 3 for N=8), total of log2N butterfly operations per stage Total latency: (N/2) log2N cycles to complete an N-point FFT (assuming one butterfly per cycle) There are many ways to "compress" the twiddle factor memory to take advantage of symmetries and subexpression sharing

48. Semi-Parallel FFT Implementation Implement an entire stage and use memory to store intermediate results (using same override technique as in serial/in-place FFTs) Advantages: Less area than full parallel Disadvantages: More complex controller than full parallel, cannot take advantage of fixed multipliers, longer latency than full parallel

49. Pipeline FFT Pipeline FFT is very common for communication systems (OFDM, DMT) Implements an entire "slice" of the FFT and reuses-hardware to perform other slices Advantages: Particularly good for systems in which x(n) comes in serially (i.e. no block assembly required), very fast, more area efficient than parallel, can be pipelined Disadvantages: Controller can become complicated, large intermediate memories may be required between stages, latency of N cycles (more if pipelining introduced)

50. Pipeline FFT Classification Pipeline FFTs can be classified in two groups: Feedforward structures (Commutator) Feedback structures Can also be categorized according to radix: Radix-2 structures Radix-4 structures Etc. The key to pipeline FFTs is the utilization factor (i.e. want all computations elements active as much as possible) High utilization factor means higher pipeline FFT efficiency

51. N-Point Pipeline FFT Architectures

52. N-Point Pipeline FFT Architectures

53. Case Study: R22SDF Architecture (He and Torkelson)

54. R22SDF Architecture (He and Torkelson)

55. Hardware Comparison

56. FFT Implementation Issues

57. Complex Multiplier Complex multiplication: z = a x b

58. Complex Multiplier Method 1: 4 real multipliers, 2 real adders (ar + jai) x (br + jbi) = [(ar x br) - (ai x bi)] + j[(ai x br) + (ar x bi)] = zr + jzi

59. Optimized Complex Multiplier Method 2: 3 real multipliers, 5 real adders (ar + jai) x (br + jbi) = [ar x (br + bi) - (ar + ai) x bi] + j[ar x (br + bi) - (ar - ai) x br] = zr + jzi

60. Optimized Complex Multiplier for FFTs Method 3: 3 real multipliers, 3 real adders If assume a is the twiddle factor, then pre-compute and store (ar + ai) and (ar - ai) in addition to ar (ar + jai) x (br + jbi) = [ar x (br + bi) - (ar + ai) x bi] + j[ar x (br + bi) - (ar - ai) x br] = zr + jzi

61. Xilinx SRL16 Optimization Xilinx SRL16 inside a LUT

62. Fixed point operation will bring significant SQNR (signal to quantization noise ratio) when dealing with a large number of sample The SQNR decreases as N increases SQNR ~ 1/4N and SQNR ~ 22B N-Point FFT; B-Data Width Based on Welch (1969) [referred to by chapter 9 of Oppenheim] SQNR Considerations

63. FFT Scaling The output of an N-point FFT with a B-bit complex input requires additional log2(N) bits for the output Input = B bits Output = B + log2(N) bits May also require an extra bit for rounding/quantization effects Solution: divide by two after every radix-2 stage to prevent wordlength growth Divide by 2 using traditional two's complement rounding (truncation, round half up) has bias ? introduces DC spur Need more advanced rounding techniques

64. Two's Complement Quantization Bias For division by 2, Truncation eliminates everything to the right of the new LSB For division by 2, Round (half up) adds '1' to the right of the new LSB, then truncates everything to the right of the new LSB

65. Unbiased Rounding Scenarios Sign Bit Based Rounding MSB 0 ?round half-up ? positive bias MSB 1 ? truncate ? negative bias When the numbers are uniformly distributed, no bias Requires round hardware at every node Randomized Requires LFSR or source of randomization Balanced Stages Rounding Good choice for pipeline FFTs

66. SQNR of R22SDF (using 16-bit datapath)

67. Twiddle Factor Storage and Compression Various advanced techniques exist to reduce the number of element stored in a twiddle factor ROM Exploit symmetry and other properties of the twiddle factor

68. FFT Error Analysis To be covered in class

ECE 699 Digital Signal Processing Hardware Implementations Lecture 6

ECE 699 Digital Signal Processing Hardware Implementations Lecture 6

Presentation Transcript

EcE 5013 Digital Signal Processing

ECE 699 Digital Signal Processing Hardware Implementations Lecture 5

Digital Signal Processing II Lecture 6: Filter Banks - Preliminaries

Digital Signal Processing

ECE 699 Digital Signal Processing Hardware Implementations Lecture 7

DIGITAL SIGNAL PROCESSING

EcE 5013 Digital Signal Processing

Lecture 6: Signal Processing III

Digital Audio Signal Processing Lecture 6 : Reverberation & Dereverberation

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

EE311 - Digital Signal Processing (Lecture # 04)

DIGITAL SIGNAL PROCESSING

EE311 - Digital Signal Processing (Lecture # 18)

Digital Signal Processing

EE311 - Digital Signal Processing (Lecture # 16)

Digital Signal Processing

Digital Signal Processing

Digital signal processing

ECE 699 Digital Signal Processing Hardware Implementations Lecture 6