Generating High-Performance General Size Linear Transform Libraries Using Spiral

Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus PüschelCarnegie Mellon University HPEC, September 2008, Lexington, MA, USA This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, and Intel TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

The Problem: Example DFT • Standard desktop computer • Same operations count ≈4nlog2(n) • Similar plots can be shown for all numerical problems Best code 30x 12x Numerical recipes 2

DFT Plot: Analysis Multiple threads: 2x Vector instructions: 3x Memory hierarchy: 5x • High performance library development= nightmare • Automation? 3

Idea: Textbook to Adaptive Library Textbook FFT ? “FFTW” 4

Goal: Teach Computers to Write Libraries Key technologies: • Layered domain specific language • Algorithm manipulation via rewriting • Feedback-driven search • Result: • Full automation Spiral Output: • FFTW equivalent library • For general input size • Vectorized and multithreaded • Performance competitive Input: Transform: Algorithm: Hardware: 2-way SIMD + multithreaded 5

Contribution: General Size Library Transform T Spiral or DFT of size 1024 library for DFT of any size Env_1 dft(1024); dft.compute(X, Y); dft_1024(X, Y); Fundamentally different problems 6

Beyond Fourier Transform and FFTW Cooley-Tukey FFT “Cooley-Tukey” DCT Overlap-save/add FIR Spiral Spiral Spiral “FFTW” “FCTW” “FIRW” Fast Walsh Transform Fast Wavelet Transform Fast Hartley Transform Spiral Spiral Spiral “WHTW” “FWTW” “FHTW” 7

Examples of Generated Libraries RDFT DHT DCT2 DCT3 DCT4 DFT • 2-way vectorized, 2-threaded • Most are faster than hand-written libs • Code size: 8–120 KLOC or 0.5–5 MB • Generation time: 1–3 hours Filter Wavelet Total: 300 KLOC / 13.3 MB of code generated in < 20 hours from a few simple algorithm specs 8 Intel IPP library 6.0 will include Spiral generated code

9 • Background • Library Generation • Experimental Results • Conclusions and Future Work

Linear Transforms Output vector Transform matrix Input vector 10 Mathematically: matrix-vector product Examples:

Fast Algorithms, Example: 4-point FFT • Fast algorithms = matrix factorizations • SPL = mathematical, declarative specification • Space of algorithms generated using breakdown rules 12 adds 4 mults 4 adds 1 mult 4 adds (when multiplied with input vector x) Permutation Kronecker product Fourier transform Identity 11

Examples of Breakdown Rules DFT Cooley-Tukey DCT “Cooley-Tukey” • “Teach” Spiral domain knowledge of algorithms. Never obsolete. • Each rule leads to a library 12

Parallelization / Vectorization Recursion Step Closure How Library Generation Works Transforms + Breakdown rules Library Target (FFTW, VSIPL, IPP FFT, ...) Library Structure recursion step closure as Σ-SPL formulas Library Implementation Build library plan Hot/cold partition Generate target code 14 High-performance library

Breakdown Rules to Library Code • Cooley-Tukey Fast Fourier Transform (FFT) k=4 DFT • Naive implementation voiddft(intn, cplxX[], cplxY[]){ k = choose_factor(n); m = n/k; Z = permute(X) fori=0 to k-1 dft_subvec(m, Z, Y, …) fori=0 to n-1 Y[i] = Y[i]*T[i]; fori=0 to m-1 dft_strided(k, Y, Y, …) } 2 extra functions needed 15

Breakdown Rules to Library Code • Cooley-Tukey Fast Fourier Transform (FFT) DFT • Optimized implementation • Naive implementation voiddft(intn, cplx X[], cplxY[]) { k = choose_factor(n); m = n/k; fori=0 to k-1 dft_strided2(m, X, Y, …) fori=0 to m-1 dft_strided3_scaled(k, Y, Y, T, …) } voiddft(intn, cplxX[], cplxY[]){ k = choose_factor(n); m = n/k; Z = permute(X) fori=0 to k-1 dft_subvec(m, Z, Y, …) fori=0 to n-1 Y[i] = Y[i]*T[i]; fori=0 to m-1 dft_strided(k, Y, Y, …) } 2 extra functions needed 2 extra functions needed 16 How to discover these specialized variants automatically?

Parallelization / Vectorization Recursion Step Closure Library Structure • Input: • Breakdown rules • Output: • Recursion step closure • Σ-SPL Implementation of each recursion step • Parallelization/Vectorization • Adds additional breakdown rules • Orthogonal to the closure generation Library Structure 17

Computing Recursion Step Closure • Input: transform T and a breakdown rule • Output: spawned recursion steps + Σ-SPL implementation • Algorithm: • Apply the breakdown rule • Convert to -SPL • Apply loop merging + index simplification rules. • Extract recursion steps • Repeat until closure is reached Parametrization (not shown) derives the independent parameter set for each recursion step 18

Recursion Step Closure Examples DFT (scalar) 4 mutually recursive functions - computed automatically - described using Σ-SPL formulas DCT4 (vectorized) 19 17 mutually recursive functions

Base Cases 20 . . . • Base cases are called “codelets” in FFTW • Why needed: • Closure is converted into mutually recursive functions • Recursion must be terminated • Larger base cases eliminate overhead from recursion • How many: • In FFTW 3.2: 183 codelets for complex DFT (21 types) 147 codelets for real DFT (18 types) • In our generator: # codelet types · # recursion steps • Obtained by using standard Spiral to generate fixed size code

Library Implementation • Input: • Recursion step closure • Σ-SPL implementation of each recursion step (base cases + recursions) • Output: • High-performance library • Target language: C++, Java, etc. • Process: • Build library plan • Perform hot/cold partitioning • Generate target language code Library Implementation Build library plan Hot/cold partition Generate target code 21 High-performance library

Double Precision Performance: Intel Xeon 51602-way vectorization, up to 2 threads Generated library Generated library FFTW Intel IPP Real DFT Complex DFT Generated library Generated library DCT-2 WHT 23

FIR Filter Performance2- and 4-way vectorization, up to 2 threads Generated library Generated library Intel IPP 8-tap wavelet 8-tap filter Generated library Generated library 32-tap filter 32-tap wavelet 24

2-D Transforms Performance2- or 4-way vectorization, up to 2 threads Generated library FFTW Generated library Intel IPP 2-D DFT double 2-D DFT single Generated library Generated library 2-D DCT-2 double 2-D DCT-2 single 25

Customization: Code Size 13 KLOC 3 KLOC FFTW: 150 KLOC 2 KLOC 1.3 KLOC 1 KLOC 26

Backend Customization:Java Generated library Generated library JTransforms Complex DFT Real DFT Generated library Generated library DCT-2 FIR Filter 27 Portable, but only 50% of scalar C performance

Summary FFT Spiral “FFTW” FIR Spiral “FIRW” 28 • Full automation: Textbook to adaptive library • Performance • SIMD • Multicore • Customization • Industry collaboration • Intel IPP 6.0 will include Spiral generated code

Generating High-Performance General Size Linear Transform Libraries Using Spiral

Generating High-Performance General Size Linear Transform Libraries Using Spiral

Presentation Transcript

General Linear Models

Performance of Generating Plant

General Linear Model

General Linear Model

Numerical Libraries in High Performance Computing

Build a High Performance, High Velocity, Lead-Generating B2B Machine

Linear Algebra Libraries for High-Performance Computing: Scientific Computing with Multicore and Accelerators

Linear Algebra Libraries for High-Performance Computing: Scientific Computing with Multicore and Accelerators

Accelerated Linear Algebra Libraries

GENERAL LINEAR MODELS

Generating FPGA-Accelerated DFT Libraries

High Performance Linear Transform Program Generation for the Cell BE Vas Chellappa

Automatically Generating High-Performance GPU Code with CUDA-CHiLL

High-Performance Fourier Transform Infrared Spectrometer:

Improving High Performance Sparse Libraries using Compiler-Assisted Specialization

PUBLIC LIBRARIES AND HIGH PERFORMANCE BROADBAND ...NOW WHAT?

High Performance Triplex Adder using CNTFET

Spiral: Automatic Generation of Industry Strength Performance Libraries

Optimizing the Use of High Performance Software Libraries

High-Performance Fourier Transform Infrared Spectrometer:

Forecasting Academic Performance using Multiple Linear Regression