1 / 23

Acceleration of Cooley-Tukey algorithm using Maxeler machine

Acceleration of Cooley-Tukey algorithm using Maxeler machine. Author : Nemanja Trifunović Mentor : Profe s sor dr. Veljko Milutinović. Introduction. Cooley-Tukey algorit h m Fast Fourier Transform Divide and conquer

willow-beck
Download Presentation

Acceleration of Cooley-Tukey algorithm using Maxeler machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acceleration of Cooley-Tukey algorithm using Maxeler machine Author:Nemanja Trifunović Mentor:Professor dr. Veljko Milutinović

  2. Introduction • Cooley-Tukey algorithm • Fast Fourier Transform • Divide and conquer • Uses: Digital Signal Processing, Telecommunications, The analysis of sound signals, … • Maxeler platform • Data flow(vs Control flow) • FPGA Example of Fourier transformation. • (Source: https://en.wikipedia.org/wiki/File:Rectangular_function.svg;https://en.wikipedia.org/wiki/File:Sinc_function_(normalized).svg, Illustration is published under Creative Commons licencom) 1/22

  3. Problem statement Design and implementation of: • The fastest possible system for calculating Fast Fourier Transform using Maxeler machine. • System that will outperform currently existing solutions to this problem. 2/22

  4. Benefits ofcalculating Fast Fourier Transformwith Maxeler machines Problem statement Benefits • Higher speed of calculation. • Lower power consumption. • Lower space consumption. Conditions • Huge amounts of data. 3/22

  5. Conditions and assumptions • Used Maxelermachine • Two Maxeler card typeMAX3424A. • In experiments with multiprocessor systems only one processor core was used. 4/22

  6. Overview of existing solutions • FFT algorithms: Prime-factor, Bruun’s, Rader’s, Winograd, Bluestein’s, … • The time complexity: O(N log N). • Performance comparisonof publicly available implementations. • Matteo Frigo andSteven G. Johnson (from MIT) 5/22

  7. Illustration of Matteo Frigo’s and Steven G. Johnson’s experiments. (Soruce:http://www.fftw.org/speed/Pentium4-3.60GHz-icc) 6/22

  8. The proposed solution • Parallelized radix 2algorithm. • Pipeline of depth O(log N), whereNis the length of input sequence. • Latency is proportional to the depth of pipeline. • After initial delay (latency) one result in every cycle. 7/22

  9. Formal analysis Radix 2 Cooley-Tukey algorithmoperates as follows: • Input sequence is divided into two equal subsequences where even elements make first, while the odd elements make second sequence. • Then, using the calculated DFT's of subsequencesDFT of the whole sequence is calculated. 8/22

  10. DFT of even sequence is denoted by Ek, DFT of odd sequence is denoted by a Okand e-2πk/Nis denoted by Wkn. Formal analysis Detailed derivation of the following formula is given it the paper 9/22

  11. Illustration of pipelined execution of radix 2algorithm. 10/22

  12. Measurment and analysis ofthe performance of proposed implementation Types of performed experiments • Calculation ofFourier transformof 100, 1.000, 10.000, 1.000.000and10.000.000 consecutive input sequencesof length 8, 16, 32 i 64points. • Maxelerimplementationvs referenceCPUimplementation • Maxelerimplementationvs best publicly available implementations 11/22

  13. Generated graphs: • Maxeler vsbest publicly available implementations ofFFT algorithm. • Run-times, depending on the number of consecutive FFT calculations(for input sequences of length8, 16, 32and64). • Acceleration obtained using Maxeler machine, compared to the CPU execution,depending on the number of consecutiveFFTcalculations(for input sequences of length8, 16, 32and64). 12/22

  14. The average execution time in seconds of publicly available algorithms for calculating FFT on different architectures for input sequence of 8 elements. 13/22

  15. Acceleration of Maxeler implementation compared to CPU implementation depending on the number of elements in the input sequence . 14/22

  16. Computation time of consecutive fast Fourier transforms expressed in seconds depending on the number of consecutive calculations. 15/22

  17. Acceleration of Maxeler implementation compared to CPU implementation depending onthe number of consecutive calculations. . 16/22

  18. Analysis of scalability and bottlenecks of proposed solution • Transfer of data to Maxeler cardand from Maxeler card • Limited number of hardware resources on single Maxeler card • Limited number of Maxeler cards 17/22

  19. Analysis of implementation Maxeler implementation of Cooley-Tukey algorithm consists of: • Rearrangement of the input sequencein bit reverse order and • Radix 2algorithm. 18/22

  20. Illustration of the kernel 19/22

  21. Implementation details • Two input and two output streams • These streams are of type:arrayType DFEType floatType = dfeFloat(8,24); DFEArrayType<DFEVar> arrayType = new DFEArrayType<DFEVar>(floatType, n); • Ratios Wnkaren’t calculated on Maxeler machine • Parameters: • N • first_level • last_level 20/22

  22. Conclusion • It’s show thatproposed solutionhas expected performanceand thatit works correctly. • Performanceof the proposed solutionis better than performance ofany publicly available implementation of Fast Fourier Transform. • To achieve these speedups it is needed to do consecutive calculations of Fast Fourier Transform 21/22

  23. Q/A Thank you for attention

More Related