220 likes | 383 Views
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization. Shixiong Xu , David Gregg University of Dublin, Trinity College Lero@TCD. Outline. Motivation Language Support for Data layout transformations Data layout transformation pragmas
E N D
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College Lero@TCD
Outline • Motivation • Language Support for Data layout transformations • Data layout transformation pragmas • Composition of data layout transformations • Data layout aware loop transformations • Implementation and Experimental Evaluation • Conclusion
Motivation (1/5) • Inter-leaved data access from the data organized in an array of structures (AoS) hinders loop vectorization from unleashing the power of SIMD • the performance of gather and scatter instructions is still not good enough on modern commodity processors (e.g. Intel AVX2). • the state-of-art data permutation optimization only deals with strides of power-of-two, e.g. the data permutation optimization in GCC. • aggressive data permutation optimization may degrade the performance due to the overheads of data permutation instructions. • even if there were some general data permutation optimization for arbitrary strides
Motivation (2/5) • For many scientific computing applications with data in AoS, different loops in the program often repeat the same pattern of data permutation. • one easy way of getting rid of these repeated data permutations is to transform the layout of the data throughout the program. • compilers face great challenges when applying automatic data-layout transformations. • Safety: the automatic data layout transformations needs very sophisticated whole-program data dependency and pointer aliasing analysis • Profitability: guided by some imprecise cost models, thus, it is hard for compilers to choose the best data-layout transformations.
Motivation (3/5) • For many scientific computing applications with data in AoS, different loops in the program often repeat the same pattern of data permutation. • It is tedious and error-prone for programmers to change their code by hand. • Programmers need to change both the type declarations and any code that operates on the array to be transformed. • To the best of our knowledge, there are no suitable ways to allow users to specify their own data layout transformations. • Prior work mainly focuses on how to annotate loop transformation rather than data layout transformation, e.g. POET (CGO 11) • Inspired by the work, Semi-automatic Composition of Loop Transfor- mations for Deep Parallelism and Memory Hierarchies (IJPP, 2006)
Motivation (4/5) • Motivating Example • tezar() in the SP (Scalar Penta-diagonal), one of the benchmarks in the NAS Parallel Benchmarks (NPB) * in this paper, we don’t consider other cache optimization like array padding. only use one field data access of stride 5
Motivation (5/5) • Possible data layout transformation and corresponding vectorization strategies Simplify vectorization
Data layout transformation pragmas (1/3) • array transform, a C language pragma to express data layout transformations on static arrays • array transform pragma consists of two parts: • array descriptor: • give a name to each array dimension • transform actions: • present basic data layout transformations
Data layout transformation pragmas (2/3) • four basic data layout transformations • strip-mining • interchange • pad • peel • terms are borrowed from classic loop transformations • see the details of semantics of these transformation in the paper. • classified into two kinds: • pre-actions • post-actions: • array peel, split array dimension for the purpose of alignment, or making the array dimension size power-of-two.
Data layout transformation pragmas (3/3) • Syntax of the array transform pragma
Composition of Data layout Transformations (1/2) • Array permutation, a sequence of array interchange. • Rectangular Array Tiling, a sequence of array strip-mining and array interchange.
Composition of Data layout Transformations (2/2) • Motivating example:
Data layout aware loop transformations (1/3) • Data layout transformation may change the code into a form that is not amenable to loop vectorization. • array strip-mining, introducing modulus operations to get off-sets in the resulting tiles. hinder the loop vectorization from detecting possible contiguous memory access.
Data layout aware loop transformations (2/3) • Solution: • data layout ware loop strip-mining • core idea: apply loop peeling and loop strip-mining according to the boundaries of data tiles from array strip-mining • kill two birds with one stone • eliminate parts of the modulus operations • enhance the data alignment • data accesses from the tile boundaries are possibly aligned
Data layout aware loop transformations (3/3) • Solution: • data layout ware loop strip-mining
Implementation and Experimental Evaluation (1/5) • Implementation • is implemented in the Cetus source-to-source compiler. • array pragmas are collected and processed in the pragma parsing phase in the Cetus compiler. • * a pre-processing pass is optional, which applies loop-unrolling and constant propagation. • may be required by the array peeling. • array transformations are done as a transformation pass in the Cetus compiler. • the high-level internal representation in Cetus simplifies the processing of array transformations.
Implementation and Experimental Evaluation (2/5) • Experimental Evaluation • A case study for data layout tuning for loop vectorization • use the SP in the NAS Parallel Benchmarks with the data set of Class A in NPB, which has the size of 64 ×64 ×64 with 400 iterations. • Intel C compiler 13.1.3 as the native C compiler Note that: • as seen in the movitivating example, we don’t consider other cache optimizations, e.g. array padding. • we only focus on the vectorization performance on a single core.
Implementation and Experimental Evaluation (3/5) • Performance of the Motivating Example • the performance improvement with data layout transformation on tzetar() with single precision is much more significant than doubles.
Implementation and Experimental Evaluation (4/5) • Performance of SP 1.8X
Implementation and Experimental Evaluation (5/5) • Performance breakdown of SP
Conclusion • We put forward a new C language pragma to allow programmers to specify a sequence of data layout transformations. This language annotation serves as a script to control data layout transformations and thus can be integrated into a performance auto-tuning framework as an extra tuning dimension. • We implemented our proposed data layout transformation pragma in the Cetus source-to-source compiler. To reduce the overhead of address computation and help vectorization, we introduce data layout aware loop transformations along with the data layout transformations. • Manual tuning of data layout transformations on the SP in the NAS Parallel Benchmarks shows that with proper data layout transformations, significant performance improvements are possible from better vectorization.
Q&A .