280 likes | 408 Views
Generating High-Performance General Size Linear Transform Libraries Using Spiral. Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus Püschel Carnegie Mellon University. HPEC, September 2008, Lexington, MA, USA.
E N D
Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus PüschelCarnegie Mellon University HPEC, September 2008, Lexington, MA, USA This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
The Problem: Example DFT • Standard desktop computer • Same operations count ≈4nlog2(n) • Similar plots can be shown for all numerical problems Best code 30x 12x Numerical recipes 2
DFT Plot: Analysis Multiple threads: 2x Vector instructions: 3x Memory hierarchy: 5x • High performance library development= nightmare • Automation? 3
Idea: Textbook to Adaptive Library Textbook FFT ? “FFTW” 4
Goal: Teach Computers to Write Libraries Key technologies: • Layered domain specific language • Algorithm manipulation via rewriting • Feedback-driven search • Result: • Full automation Spiral Output: • FFTW equivalent library • For general input size • Vectorized and multithreaded • Performance competitive Input: Transform: Algorithm: Hardware: 2-way SIMD + multithreaded 5
Contribution: General Size Library Transform T Spiral or DFT of size 1024 library for DFT of any size Env_1 dft(1024); dft.compute(X, Y); dft_1024(X, Y); Fundamentally different problems 6
Beyond Fourier Transform and FFTW Cooley-Tukey FFT “Cooley-Tukey” DCT Overlap-save/add FIR Spiral Spiral Spiral “FFTW” “FCTW” “FIRW” Fast Walsh Transform Fast Wavelet Transform Fast Hartley Transform Spiral Spiral Spiral “WHTW” “FWTW” “FHTW” 7
Examples of Generated Libraries RDFT DHT DCT2 DCT3 DCT4 DFT • 2-way vectorized, 2-threaded • Most are faster than hand-written libs • Code size: 8–120 KLOC or 0.5–5 MB • Generation time: 1–3 hours Filter Wavelet Total: 300 KLOC / 13.3 MB of code generated in < 20 hours from a few simple algorithm specs 8 Intel IPP library 6.0 will include Spiral generated code
9 • Background • Library Generation • Experimental Results • Conclusions and Future Work
Linear Transforms Output vector Transform matrix Input vector 10 Mathematically: matrix-vector product Examples:
Fast Algorithms, Example: 4-point FFT • Fast algorithms = matrix factorizations • SPL = mathematical, declarative specification • Space of algorithms generated using breakdown rules 12 adds 4 mults 4 adds 1 mult 4 adds (when multiplied with input vector x) Permutation Kronecker product Fourier transform Identity 11
Examples of Breakdown Rules DFT Cooley-Tukey DCT “Cooley-Tukey” • “Teach” Spiral domain knowledge of algorithms. Never obsolete. • Each rule leads to a library 12
13 • Background • Library Generation • Experimental Results • Conclusions and Future Work
Parallelization / Vectorization Recursion Step Closure How Library Generation Works Transforms + Breakdown rules Library Target (FFTW, VSIPL, IPP FFT, ...) Library Structure recursion step closure as Σ-SPL formulas Library Implementation Build library plan Hot/cold partition Generate target code 14 High-performance library
Breakdown Rules to Library Code • Cooley-Tukey Fast Fourier Transform (FFT) k=4 DFT • Naive implementation voiddft(intn, cplxX[], cplxY[]){ k = choose_factor(n); m = n/k; Z = permute(X) fori=0 to k-1 dft_subvec(m, Z, Y, …) fori=0 to n-1 Y[i] = Y[i]*T[i]; fori=0 to m-1 dft_strided(k, Y, Y, …) } 2 extra functions needed 15
Breakdown Rules to Library Code • Cooley-Tukey Fast Fourier Transform (FFT) DFT • Optimized implementation • Naive implementation voiddft(intn, cplx X[], cplxY[]) { k = choose_factor(n); m = n/k; fori=0 to k-1 dft_strided2(m, X, Y, …) fori=0 to m-1 dft_strided3_scaled(k, Y, Y, T, …) } voiddft(intn, cplxX[], cplxY[]){ k = choose_factor(n); m = n/k; Z = permute(X) fori=0 to k-1 dft_subvec(m, Z, Y, …) fori=0 to n-1 Y[i] = Y[i]*T[i]; fori=0 to m-1 dft_strided(k, Y, Y, …) } 2 extra functions needed 2 extra functions needed 16 How to discover these specialized variants automatically?
Parallelization / Vectorization Recursion Step Closure Library Structure • Input: • Breakdown rules • Output: • Recursion step closure • Σ-SPL Implementation of each recursion step • Parallelization/Vectorization • Adds additional breakdown rules • Orthogonal to the closure generation Library Structure 17
Computing Recursion Step Closure • Input: transform T and a breakdown rule • Output: spawned recursion steps + Σ-SPL implementation • Algorithm: • Apply the breakdown rule • Convert to -SPL • Apply loop merging + index simplification rules. • Extract recursion steps • Repeat until closure is reached Parametrization (not shown) derives the independent parameter set for each recursion step 18
Recursion Step Closure Examples DFT (scalar) 4 mutually recursive functions - computed automatically - described using Σ-SPL formulas DCT4 (vectorized) 19 17 mutually recursive functions
Base Cases 20 . . . • Base cases are called “codelets” in FFTW • Why needed: • Closure is converted into mutually recursive functions • Recursion must be terminated • Larger base cases eliminate overhead from recursion • How many: • In FFTW 3.2: 183 codelets for complex DFT (21 types) 147 codelets for real DFT (18 types) • In our generator: # codelet types · # recursion steps • Obtained by using standard Spiral to generate fixed size code
Library Implementation • Input: • Recursion step closure • Σ-SPL implementation of each recursion step (base cases + recursions) • Output: • High-performance library • Target language: C++, Java, etc. • Process: • Build library plan • Perform hot/cold partitioning • Generate target language code Library Implementation Build library plan Hot/cold partition Generate target code 21 High-performance library
22 • Background • Library Generation • Experimental Results • Conclusions and Future Work
Double Precision Performance: Intel Xeon 51602-way vectorization, up to 2 threads Generated library Generated library FFTW Intel IPP Real DFT Complex DFT Generated library Generated library DCT-2 WHT 23
FIR Filter Performance2- and 4-way vectorization, up to 2 threads Generated library Generated library Intel IPP 8-tap wavelet 8-tap filter Generated library Generated library 32-tap filter 32-tap wavelet 24
2-D Transforms Performance2- or 4-way vectorization, up to 2 threads Generated library FFTW Generated library Intel IPP 2-D DFT double 2-D DFT single Generated library Generated library 2-D DCT-2 double 2-D DCT-2 single 25
Customization: Code Size 13 KLOC 3 KLOC FFTW: 150 KLOC 2 KLOC 1.3 KLOC 1 KLOC 26
Backend Customization:Java Generated library Generated library JTransforms Complex DFT Real DFT Generated library Generated library DCT-2 FIR Filter 27 Portable, but only 50% of scalar C performance
Summary FFT Spiral “FFTW” FIR Spiral “FIRW” 28 • Full automation: Textbook to adaptive library • Performance • SIMD • Multicore • Customization • Industry collaboration • Intel IPP 6.0 will include Spiral generated code