320 likes | 464 Views
Optimizing Radix-2 FFT Algorithm for PowerPC Architecture. 9/24/2007 PEREZ-SEVA Jean-Paul, THALES Computers. The authors. Jean-Paul Perez-Seva Thales Computers - www.thalescomputers.com Serge Tissot Thales Computers - www.thalescomputers.com Michel Cosnard INRIA - www.inria.fr
E N D
Optimizing Radix-2 FFT Algorithm for PowerPC Architecture 9/24/2007 PEREZ-SEVA Jean-Paul, THALES Computers
The authors • Jean-Paul Perez-Seva • Thales Computers - www.thalescomputers.com • Serge Tissot • Thales Computers - www.thalescomputers.com • Michel Cosnard • INRIA - www.inria.fr • Ghislain Oudinet • ISEN - www.isen.fr • Joseph Eicher • Thales Computers - www.thalescomputers.com
Objectives • Highlight the advantages of PowerPC architectures • Smart and complete ISA • Powerful SIMD engine • Optimize a Fast Fourier Transform algorithm • Single-precision complex FFT algorithm • Show the efficiency of the Radix-2 algorithm • Its strengths • Its weaknesses • Find best fit FFT algorithm to architecture
PowerPC Architecture: 970 case www.arstechnica.com
PowerPC Architecture: 970 case • Superscalar architecture • Up to 5 instructions dispatched per cycle • Out of order execution • PowerPC SIMD instructions close to DSP • SIMD instructions for intensive tasks • 4 way single-precision floating-point vector • Efficient instructions for signal processing • Fused Multiplication Addition instruction (FMA) • Permutation instructions • Up to 16 Gflops at 2 GHz
FFT Algorithm • The choice of the FFT algorithm • Radix-2 algorithm as best candidate • Efficient • Simple • Regular • Radix-2 complexity of calculation • 5NlogN operations • logN steps • N/2 butterflies per step • 10 floating-point operations per butterfly
Our implementation • The optimization comes in three steps • The butterfly optimization • Reduce the number of operations • Or reduce the number of instructions • The SIMDization • Single-precision floating point calculations • 4 way parallelism • The pipeline • Smartly order instructions • Latency • Throughput • Dispatch process
The radix-2 butterfly • A Radix-2 butterfly: • 10 floating point operations • 4 multiplications • 6 additions • 8 instructions • 4 multiplications additions, 2 additions & 2 subtractions • 8 multiplications additions
The radix-2 butterfly • Our optimization: • New formula expression: • Only 6 fused multiplication addition instructions
SIMDization • How to use the SIMD engine from PowerPC • Find data parallelism for at least 4 data points • All the butterflies from the same step are independent each other • Choice of complex data representation • Split complex data arrays • Real and Imaginary values stored in separate arrays • Single-precision floating-point
SIMDization After bit reversing Example of 16-point FFT processing
SIMDization step (n-1) step n Independent blocks of calculation before the last 2 steps
SIMDization Each point of the 4 blocks corresponds to 4 aligned data points
SIMDization • Benefits of this solution are : • Less permutations • Data flows in right order for direct SIMD processing • No permutation until the last two steps (step n-1 and n) • Bit reversing less problematic • No bit reverse to do in a vector • Only bit reverse with vector granularity • 4 times less indexes to calculate
Pipelining • Some rules to follow in order to produce good code • Do not reuse data before the latency of previous generating instruction expires • Preserve the register resources • Superscalar behavior • Examples of possible parallel dispatch and execution • SIMD Load / SIMD FMA / Integer add • Load / SIMD permutation / SIMD FMA / Integer add • Out of order execution is not a concern
Pipelining • Tips: Group steps 2 by 2 • Butterflies are composed of: • 6 fused multiplication addition instructions • 4 load and 4 store instructions • Memory accesses can’t be hidden behind calculations • Keep results as long as possible in registers • Eliminate unnecessary memory accesses • By grouping steps by 2 • 24 fused multiplication addition instructions • 8 load and 8 store instructions
Performances and results • Theoretical performance: • Reduce more than 8 times the number of instructions • Could achieve up to 13,3GFlops at 2GHz if SIMD Floating-point pipeline kept saturated • After implementation: • 10Gflops at 2GHz for 256 and 1k points • Better performances than other optimized PowerPC FFT libraries
Performances and results • Comparison with vDSP Library on 2GHz PPC970 FX GFlops Complex Data points
Conclusions • Radix-2 FFT algorithm is a good fit for SIMD architecture with fused multiplication addition instructions • Improvements • Lack of automatic instruction ordering tool during design • Efficiency limited by the lack of registers • Solved in the Cell SPE architecture • Stockham FFT algorithm similar to radix-2 without bit reverse
Thank you • Any question ?
Back up slide 1 • Why do we choose the Radix-2 FFT Algorithm • 5 main FFT algorithm selection criteria • The butterfly expressed in multiplication addition • The parallelism capacity • The register consumption • The unrolling freedom • The pipeline behavior
Back up slide 3 • By dealing with independent butterflies • After grouping steps by 2
Back up slide 4 • Standard Bit Reversing • After SIMD implementation