230 likes | 429 Views
Acceleration of Cooley-Tukey algorithm using Maxeler machine. Author : Nemanja Trifunović Mentor : Profe s sor dr. Veljko Milutinović. Introduction. Cooley-Tukey algorit h m Fast Fourier Transform Divide and conquer
E N D
Acceleration of Cooley-Tukey algorithm using Maxeler machine Author:Nemanja Trifunović Mentor:Professor dr. Veljko Milutinović
Introduction • Cooley-Tukey algorithm • Fast Fourier Transform • Divide and conquer • Uses: Digital Signal Processing, Telecommunications, The analysis of sound signals, … • Maxeler platform • Data flow(vs Control flow) • FPGA Example of Fourier transformation. • (Source: https://en.wikipedia.org/wiki/File:Rectangular_function.svg;https://en.wikipedia.org/wiki/File:Sinc_function_(normalized).svg, Illustration is published under Creative Commons licencom) 1/22
Problem statement Design and implementation of: • The fastest possible system for calculating Fast Fourier Transform using Maxeler machine. • System that will outperform currently existing solutions to this problem. 2/22
Benefits ofcalculating Fast Fourier Transformwith Maxeler machines Problem statement Benefits • Higher speed of calculation. • Lower power consumption. • Lower space consumption. Conditions • Huge amounts of data. 3/22
Conditions and assumptions • Used Maxelermachine • Two Maxeler card typeMAX3424A. • In experiments with multiprocessor systems only one processor core was used. 4/22
Overview of existing solutions • FFT algorithms: Prime-factor, Bruun’s, Rader’s, Winograd, Bluestein’s, … • The time complexity: O(N log N). • Performance comparisonof publicly available implementations. • Matteo Frigo andSteven G. Johnson (from MIT) 5/22
Illustration of Matteo Frigo’s and Steven G. Johnson’s experiments. (Soruce:http://www.fftw.org/speed/Pentium4-3.60GHz-icc) 6/22
The proposed solution • Parallelized radix 2algorithm. • Pipeline of depth O(log N), whereNis the length of input sequence. • Latency is proportional to the depth of pipeline. • After initial delay (latency) one result in every cycle. 7/22
Formal analysis Radix 2 Cooley-Tukey algorithmoperates as follows: • Input sequence is divided into two equal subsequences where even elements make first, while the odd elements make second sequence. • Then, using the calculated DFT's of subsequencesDFT of the whole sequence is calculated. 8/22
DFT of even sequence is denoted by Ek, DFT of odd sequence is denoted by a Okand e-2πk/Nis denoted by Wkn. Formal analysis Detailed derivation of the following formula is given it the paper 9/22
Illustration of pipelined execution of radix 2algorithm. 10/22
Measurment and analysis ofthe performance of proposed implementation Types of performed experiments • Calculation ofFourier transformof 100, 1.000, 10.000, 1.000.000and10.000.000 consecutive input sequencesof length 8, 16, 32 i 64points. • Maxelerimplementationvs referenceCPUimplementation • Maxelerimplementationvs best publicly available implementations 11/22
Generated graphs: • Maxeler vsbest publicly available implementations ofFFT algorithm. • Run-times, depending on the number of consecutive FFT calculations(for input sequences of length8, 16, 32and64). • Acceleration obtained using Maxeler machine, compared to the CPU execution,depending on the number of consecutiveFFTcalculations(for input sequences of length8, 16, 32and64). 12/22
The average execution time in seconds of publicly available algorithms for calculating FFT on different architectures for input sequence of 8 elements. 13/22
Acceleration of Maxeler implementation compared to CPU implementation depending on the number of elements in the input sequence . 14/22
Computation time of consecutive fast Fourier transforms expressed in seconds depending on the number of consecutive calculations. 15/22
Acceleration of Maxeler implementation compared to CPU implementation depending onthe number of consecutive calculations. . 16/22
Analysis of scalability and bottlenecks of proposed solution • Transfer of data to Maxeler cardand from Maxeler card • Limited number of hardware resources on single Maxeler card • Limited number of Maxeler cards 17/22
Analysis of implementation Maxeler implementation of Cooley-Tukey algorithm consists of: • Rearrangement of the input sequencein bit reverse order and • Radix 2algorithm. 18/22
Implementation details • Two input and two output streams • These streams are of type:arrayType DFEType floatType = dfeFloat(8,24); DFEArrayType<DFEVar> arrayType = new DFEArrayType<DFEVar>(floatType, n); • Ratios Wnkaren’t calculated on Maxeler machine • Parameters: • N • first_level • last_level 20/22
Conclusion • It’s show thatproposed solutionhas expected performanceand thatit works correctly. • Performanceof the proposed solutionis better than performance ofany publicly available implementation of Fast Fourier Transform. • To achieve these speedups it is needed to do consecutive calculations of Fast Fourier Transform 21/22
Q/A Thank you for attention