1 / 37

Efficient FFTs On VIRAM

Efficient FFTs On VIRAM. Randi Thomas and Katherine Yelick Computer Science Division University of California, Berkeley 11-15-99 {randit, yelick} @cs.berkeley.edu. Outline. Introduction Why Study the FFT? VIRAM architecture and implementation About the FFT The “Naïve” Algorithm

mele
Download Presentation

Efficient FFTs On VIRAM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient FFTs On VIRAM Randi Thomas and Katherine Yelick Computer Science Division University of California, Berkeley 11-15-99 {randit, yelick} @cs.berkeley.edu

  2. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  3. Introduction • What is IRAM? • IRAM is a project at Berkeley exploring • an unconventional microprocesor design • combines logic & embedded DRAM: “Intelligent RAM” • single chip system = low power and high performance • suitable for multimedia applications • What is VIRAM? • VIRAM is a Vector architecture for IRAM • combines a vector processor with embedded DRAM • suitable for portable devices • What is the FFT? • The Fast Fourier Transform converts: • a time-domain function into a frequency spectrum

  4. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  5. Why Study The FFT? • 1D Fast Fourier Transforms (FFTs) are: • Critical for many signal processing problems • Used widely for filtering in Multimedia Applications • Image Processing • Speech Recognition • Audio & video • Graphics • Important in many Scientific Applications • The building block for 2D/3D FFTs All of these are VIRAM target applications!

  6. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  7. “System on a chip” Scalar processor: 200 MHz “vanilla” MIPS core Embedded DRAM: 32MB, 16 Banks Memory Crossbar: 25.6 GB/s Vector processor: 200 MHz I/O: 4 x 100 MB/sec Power/area/cost/ bandwidth advantages over multi-chip systems 17mm x 17 mm 1.2 Volts, 2 Watt power target Memory Memory (128 (128 Mbits Mbits / 16 / 16 MBytes MBytes ) ) C P 4 Vector Pipes/Lanes U +$ Memory crossbar I/O VIRAM Implementation

  8. Why Vectors For IRAM? • Low complexity architecture • means lower power and area • Takes advantage of on-chip memory bandwidth • 100x bandwidth of Work Station memory hierarchies • High performance for apps w/ fine-grained ||ism • Delayed pipeline hides memory latency • Therefore no cache is necessary • further conserves power and area • Greater code density than VLIW designs like: • TI’s TMS320C6000 • Motorola/Lucent StarCore • AD’s TigerSHARC • Siemens (Infineon) Carmel

  9. LANE 3 LANE 1 LANE 4 LANE 2 64-bits 64-bits 64-bits 64-bits 32-bits 32-bits 32-bits 32-bits 32-bits 32-bits 32-bits 32-bits VL 5 VL 7 VL 2 VL 1 VL 6 VL 8 VL 3 VL 4 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 VL 15 VL 1 VL 13 VL 14 VL 10 VL 16 VL 2 VL 5 VL 6 VL 11 VL 7 VL 12 VL 3 VL 4 VL 9 VL 8 Scalable VIRAM Design • VectorProcessor has four 64-bit pipelines=lanes • Each lane has: • 2 integer functional units • 1 floating point functional unit • All functional units have a multiply-add operation that completes in 1 cycle • Each lane can be subdivided into: • two 32-bit virtual lanes • four 16-bit virtual lanes • Nothing about the architecture fixes the # of lanes: • Easily scales down to a lower power version • Or up to a higher performance version

  10. 16-bit Integer 32-bit Integer 32-bit Single Precision All multiply-adds No multiply-adds All multiply-adds No multiply-adds All multiply-adds No multiply-adds Operations per Cycle 32 Integer 64 Integer 16 Floating Point 16 Floating Point 8 Floating Point 32 Floating Point 12.8 GOP/s 6.4 GOP/s 6.4 GFLOP/s 3.2 GFLOP/s 3.2 GFLOP/s 1.6 GFLOP/s Peak Performance Peak Performance of VIRAM Implementation • Peak Performance of VIRAM • 64-bit Double Precision is also supported, but it is beyond the scope of this talk

  11. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  12. Computing the DFT (Discrete FT) • Given the N-element vector x, its 1D DFT is another N-element vector y, given by formula: • where = the jkth root of unity • N is referred to as the number of points • The FFT (Fast FT) • Uses algebraic Identities to compute DFT in O(NlogN) steps • The computation is organized into log2N stages • for the radix 2 FFT

  13. = X0 + w*XN/2 X0 . = X0 - w*XN/2 XN/2 . . . Computing A Complex FFT • Basic computation for a radix 2 FFT: • The basic computation on VIRAM: • 2 multiply-adds + 2 multiplies + 4 adds = • 8 operations • 2 GFLOP/s is the VIRAM Peak Performance for this mix of instructions • Xiare the data points • wis a “root of unity”

  14. VIRAM Implementation Terms • The Maximum Vector Length (MVL): • Is the maximum number of elements that one vector register can hold • Is set by the architecture and is based on what data width the algorithm is using. For: • 64-bit data, MVL = 32 elements/vector register • 32-bit data of any kind, MVL = 64 elements/vector • 16-bit data, MVL = 128 elements/vector register • The Vector Length (VL): • Is the total number of elements to be computed • Is set by the algorithm: the inner for-loop • A butterfly group is the set of elements that can be computed upon in 1 FFT stage using the same basic computation & the same root of unity

  15. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  16. Stage 3VL = 2 Stage 4VL = 1 Stage 2VL = 4 Stage 1VL = 8 vr1 vr1 vr2 vr1 vr2 vr1 vr2 vr2 Time Cooley-Tukey FFT Algorithm vr1 + vr2 = 1 butterfly group; VL = vector length

  17. Diagram illustrates “naïve” vectorization A stage vectorizes well whenVL ³ MVL Poor HW utilization when VL is small(< MVL) Later stages of the FFT have shorter vector lengths: the # of elements in one butterfly group is smaller in the later stages Vectorizing the FFT Stage 3VL = 2 Stage 4VL = 1 Stage 2VL = 4 Stage 1VL = 8 vr1 vr1 vr2 vr1 vr2 vr1 vr2 vr2 Time

  18. Naïve Algorithm: What Happens When Vector Lengths Get Short? • Performance peaks (1.4-1.8 GFLOPs) if vector lengths are ³ MVL • For all FFT sizes, 94% to 99% of the total time is spent doing the last 6 stages, when VL < MVL (= 64) • For 1024 point FFT, only 60% of the work is done in the last 6 stages • Performance significantly drops when vector lengths < # lanes (=8)

  19. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  20. Optimization #1: Add auto-increment • Automatically adds an increment to the current address in order to obtain the next address • Auto-increment helps to: • Reduce the scalar code overhead • Useful: • To jump to the next butterfly group in an FFT stage • For processing a sub-image of a larger image in order to jump to the appropriate pixel in next row

  21. Optimization #1: Add auto-increment • Small gain from auto-increment • For 1024 point FFT: • 202 MFLOP/s w/o AI • 225 MFLOP/s with AI • Still 94-99% of the time spent in last 6 stages where the VL < 64 • Conclusion: Auto-increment helps, but scalar overhead is not the main source of the inefficiency

  22. 256 3 128 5 > 2048 1 FFT Sizes 512 - 2048 2 Number of Transposes Needed Optimization #2: Memory Transposes • Reorganize the data layout in memory to maximize the vector length in later FFT stages • View the 1D vector as a 2D matrix • Reorganization is equivalent to a matrix transpose • Transposing the data in memory only works for N ³ (2 * MVL) • Transposing in memory adds significant overhead • Increased memory traffic • cost too high to make it worthwhile • Multiple transposes exacerbate the situation:

  23. 0 1 2 3 4 5 6 7 vr1 Stage 1: SWAP vr2 8 9 10 11 12 13 14 15 vr1 0 1 2 3 8 910 11 Stage 2: SWAP SWAP vr2 4 56 712 13 14 15 vr1 0 1 45 8 91213 Stage 3: SWAP SWAP vr2 2 36 71011 14 15 Optimization #3: Register Transposes • Rearrange the elements in the vector registers • Provides a way to swap elements between 2 registers • What we want to swap: • This behavior is hard to implement with one instruction in hardware

  24. Optimization #3: Register Transposes • Two instructions were added to the VIRAM Instruction Set Architecture (ISA): • vhalfup andvhalfdn: both move elements one-way between vector registers • Vhalfup/dn: • Are extensions of already existing ISA support for fast in-register reductions • Required minimal additional hardware support • mostly control lines • Much simpler and less costly than a general element permutation instruction • Rejected in the early VIRAM design phase • An elegant, inexpensive, powerful solution to the short vector length problem of the later stages of the FFT

  25. vr1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 • move vr3 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 vr1 vr2 • vhalfup vr2 vr1 0 1 2 3 8 9 10 11 • vhalfdn vr3 vr2 4 5 6 7 12 13 14 15 Optimization #3: Register Transposes Stage 1: SWAP • Three steps to swap elements: • Copy vr1 into vr3 • Move vr2’s low to vr1’s high (vhalfup) • vr1 now done • Move vr3’s high to vr2’s low (vhalfdn) • vr2 now done

  26. Optimization #3: Final Algorithm • The optimized algorithm has two phases: • Naïve algorithm is used for stages whose VL ³ MVL • Vhalfup/dn code is used on: • Stages whose VL < MVL = the last log2 (MVL) stages • Vhalfup/dn: • Eliminates short vector length problem • Allows all vector computations to have VL equal to MVL • Multiple butterfly groups done with 1 basic operation • Eliminates all loads/stores between stages • Optimized vhalf algorithm does: • Auto-increment, software pipelining, code scheduling • the bit reversal rearrangements of the results • Single precision, floating point, complex, radix-2 FFTs

  27. Optimization #3: Register Transposes • Every vector instruction operates with VL=MVL • For all stages • Keeps the vector pipeline fully utilized • Time spent in the last 6 stages • drops to 60% to 80% of the total time

  28. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  29. Performance Results • Both Naïve versions utilize the auto-increment feature • 1 does bit reversal, the other does not • Vhalfup/dn with and without bit reversal are identical • Bit reversing the results slows naïve algorithm, but not vhalfup/dn

  30. Performance Results • The performance gap testifies: • To the effectiveness of the vhalfup/dn algorithm in fully utilizing VIRAM’s vector unit • The importance of the new vhalfup/dn instructions

  31. Performance Results • VIRAM is competitive with high-end specialized DSPs • Could match or exceed the performance of these DSPs if the VIRAM architecture were implemented commercially • These simulations: • Are an academic proof-of-concept implementation • Do not demonstrate the full potential of the architecture

  32. Outline • Introduction • Why Study the FFT? • VIRAM architecture and implementation • About the FFT • The “Naïve” Algorithm • 3 Optimizations to the “Naïve” Algorithm • Performance Results • Conclusions

  33. Conclusions • Optimizations to eliminate short vector lengths are necessary for doing the FFT • VIRAM is capable of performing FFTs at performance levels comparable to or exceeding those of high-end floating point DSPs. It achieves this performance via: • A highly tuned algorithm designed specifically for VIRAM • A set of simple, powerful ISA extensions that underlie it • Efficient parallelism of vector processing embedded in a high-bandwidth on-chip DRAM memory • Performance of FFTs on VIRAM has the potential to improve significantly over the results presented here: • 32-bit fixed point FFTs could run up to 2 times faster than floating point versions • There are twice as many integer functional units as FPFUs • Simulations are based on the current proof-of-concept VIRAM implementation which has made compromises: • Trades off potential performance for ease of implementation in an academic setting

  34. Conclusions (2) • Since VIRAM includes both general-purpose CPU capability and DSP muscle, it shares the same space in the emerging market of hybrid CPU/DSPs as: • Infineon TriCore • Hitachi SuperH-DSP • Motorola/Lucent StarCore • Motorola PowerPC G4 (7400) • VIRAM’s vector processor plus embedded DRAM design may have further advantages over more traditional processors in: • Power • Area • Performance • To download a copy of my paper, or to see the upcoming fixed point performance graphs on VIRAM: http://www.cs.berkeley.edu/~randit/papers/viram-fft.ps

  35. Backup Slides

  36. vr1 Stage 1: 1 BF (ButterFly), 4 #s/BF 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 • vhalfdn vr2, vr3, 2 • vhalfup vr1, vr2, 2 • move vr3, vr1, 2 0 1 2 3 8 9 10 11 4 5 6 7 \ \ \ \ vr1 vr3 SWAP 4 5 6 7 12 13 14 15 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 0 1 2435 8 9 10121113 vr2 Stage 2: 2 BFs, 2 #s/BF vr2 vr1 vr1 0 1 2 3 8 9 10 11 4 5 6 7 \ \ \ \ SWAP SWAP vr2 \ \ \\ \\ 8 9 10 11 \ \ \\ \\ 425367 1210 1311 14 15 Stage 3: 4 BFs, 1 #/BF vr3 0 12 4 56 8 910 12 1314 vr1 vr2 \ \ \\ \\ SWAP 21 3 65 7 109 11 1413 15 4 5 6 7 12 13 14 15 8 9 10 11 SWAP VR2 \ \ \ \\ . \ \ \\ \\ vr1 <= vr1+ w*vr2 vr1 . vr2 <= vr1- w*vr2 \ \ \\ \\ vr2 Optimization #3: Register Transposes Using 2 New ISA Instructions: Transposes between vr1 & vr2: • use vr1 & vr2 to compute :

More Related