Polyhedral Code Generation In The Real World

Polyhedral Code GenerationIn The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN

Outline • Introduction • Affine schedules • Formal General Form • Contributions • Focus on Modulo Conditional Removal (speed & quality) • Experimental Results 2

Introduction – Polyhedral Model • Powerful expressiveness for high level transformations (parallelism, locality) • Can express any composition of usual loop transformations [Pugh91] • Compact representation of all legal transformations [Feautrier90] • Code Generation was the weakest link [Griebl & al. 98] • Until recent algorithm [Quilleré00]  without transformations • However, still problematic on long, parametric sequences on SPECs 3

SwimFP2000 [ICS05] ~ 30 polyhedral loop transformations 40% speedup wrt best peak perf. on AMD64 Goal : Generation time comparable to BE of a real compiler (EKOPath) Introduction – Transformations WHY TRANSFORM ??? Cholesky factorization, 6 statements, Optimal allocation functions [McKin92] • Huge code generation times (ex: full Swim ~ 421  2267 lines, 20 mn / 300 MB) • In the context of complex transformations 4

Code Generation : syntactic loops from matrix representation Introduction – Context & Notations 5

Affine Schedules 6

1 0 0 1 = Affine Schedule – Trivial Example j t2 time 3 3 (i,j)  (t1=i, t2=j) domain j 3 1 1 1 1 3 i 1 3 t1 1 3 i for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t1,j=t2) time S(2,1) S(1,1) S(3,2) S(1,2) S(1,3) S(2,3) S(2,2) S(3,1) S(3,3) domain • Bijection between domain and time iterations • Time iterations determine the generated loops (nesting, bounds) • Execution follows lexicographic order on time dimensions • Domain values touched by the statement : i=t1,j=t2 7

0 1 1 0 = Affine Schedule – Loop Interchange j t2 time 3 3 (i,j)  (t1=j, t2=i) domain j 3 1 1 1 1 3 i 1 3 t1 1 3 i for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t2,j=t1) time S(2,1) S(1,1) S(3,2) S(1,2) S(1,3) S(2,3) S(2,2) S(3,1) S(3,3) domain • Another bijection between domain and time iterations • New bounds computation • Lexicographic order on time dimensions • Domain values touched by the statement : i=t2,j=t1 8

Affine Schedule – Parallel Wavefronts j 3 (i,j)  (t1= i+j) j domain time 3 = 1 1 1 2 6 1 t1 1 3 i 1 3 i for(t1=2;t1<=6;t1++) DOALL{(i,j)|i+j==t1} S(i,j) for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) time S(2,1) S(1,1) S(3,3) S(1,3) S(2,3) domain S(1,2) S(3,2) S(2,2) S(3,1) • NOT a bijection (just a surjection) • New bounds computation (t1:[2, 6]) • Domain values touched by the statement: {(i,j)|i+j==t1} 9

1 0 1 0 0 1 0 1 Affine Schedule – Statement Shifting j t2 time (i,j)1 (t1=i, t2=j) (i,j)2  (t1=i+1, t2=j) 3 3 domain j 0 = 3 1 1 0 1 1 1 = 1 2 2 0 1 3 i 1 4 t1 1 3 i for(t2=1;t2<=3;t2++) S1(i=1,j=t2) for(t1=2;t1<=3;t1++) for(t2=1;t2<=3;t2++) S1(i=t1,j=t2) S2(i=t1-1,j=t2) for(t2=1;t2<=3;t2++) S2(i=4-1,j=t2) P for(i=1;i<=3;i++) for(j=1;j<=3;j++) S1(i,j) S2(i,j) time K domain E • New bounds computation (S1: [1,3]x[1,3] S2: [2,4]x[1,3]) have disjoint parts • Separation phase needed on each time dimension (3nb_stmtw.c. complexity) 10

General Case • Schedules: Zmi Zni for each statement Si • Schedules associate logical time to each iteration domain point • Time value sets need to be separated  scattering functions Time iterators Domain iterators Time Domain • Time part used for separation and ordering (Polylib computations 2dim[Wilde93]) • Domain part determines the values spanned by time dimensions • Quilleré separation phase [Quilleré00, Bastoul04] 11

Quilleré separation phase 12

Separation Principles • Consider statements with domain and schedule functions such that: • S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) • S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) • S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) Considering t1 worklist remaining [2,5] [0,3] Polyhedral inter / diff (2dim) [0,1] [4,5] [2,3] 13

Separation Principles • Consider statements with domain and schedule functions such that: • S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) • S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) • S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) Considering t1 worklist remaining [0,1] [4,5] [-2,6] [2,3] Polyhedral inter / diff (2dim) 3nb_stmt w.c. compl. kernel [2,3] [-2,-1] [0,1] [0,1] [-2,-1] 14

Separation Principles • Consider statements with domain and schedule functions such that: • S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) • S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) • S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) for(t1=-2;t1<=-1;t1++) for(t2=5;t2<=9;t2++) S3(…) for(t1=0;t1<=1;t1++) for(t2=5;t2<=7;t2++) S2(…) S3(…) for(t2=8;t2<=9;t2++) S3(…) ... Considering t1 That was for the first time dimension Recursively for all time dimensions Result is a syntax tree of the generated loops 15

Contributions State of the art polyhedral code generator  CLooG [Bastoul04] ALL PERFORMANCE COMPARISONS WILL BE CLooG vs URGenT Real World Issues • Problems provided by different sources (academia, industry, SPECFP2000) • Exhibit different challenging issues Code Generation Speed • Node fusion (exploiting transformations’ “locality”) • Exploiting scalar dimensions (replacing exponential computations with trivial ones) • Domain iterator mapping improvement (replacing exponential by matrix inversions) Code Quality • Faster If-Hoisting yielding much smaller code (conditional factorization) • Modulo Conditional removal by strip-mining (stride issue) (detailed) 16

Generation Speed Improvements 17

Generation Speed – Node Fusion • Multidimensional schedules allow expression of non affine (polynomial) quantities as affine ones with more dimensions  improved flexibility • Drawback  Pressure on code generation (height of the tree) • Add parameters Add dimensions (polyhedral operation complexity) HOWEVER • Loop level transformations affect blocks of statements (tiling, interchange…) • Polyhedron inclusion check is NOT exponential Before each separation phase, fuse consecutive nodes with equal scattering polyhedra. 18

Generation Speed – Scalar Dimensions • Some multidimensional schedules have scalar dimensions (UTF, URUK[ICS05]) • Scalar dimensions express strict statement interleaving • Comparison of integers, no need for polyhedral separation • Syntactic tree height reduction (potentially half the height) • Marginal overhead for detection and computation • Combines well with Node Fusion 19

Generation Speed – Domain Iterator Regen. • Generation of sequential loops for non invertible schedules (wavefronts) • CLooG [Bas04] handles it with polyhedral projection on domain iterators • Drawback  Adds dimensions (polyhedral operation complexity) • Drawback  Additional polyhedral computations on each leaf ST after Qui. separation Phase (3nb_stmts) Use transformation invertibility (ideally, given the rank, mix of projections and invertibility) 20

Code Quality Improvements 21

t2 for t1 t1 … … for t2 cond: t1<= 4 for t3 cond: 11 <= t1 cond: 5 <= t1<=10 t1 for t1 for t2 for t3 Code Quality – If Hoisting • Quilleré separation phase leaves conditionals on triangular loops • Need of the so-called backtracking phase  too aggressive (code bloat) • Potentially tremendous amount of useless work Smaller Code No useless work (simplification IS needed) Explains the generation speedup on dreamupT3 Code Bloat If-Hoisting illustration Useless Work Backtracking illustration 22

Code Quality – If Hoisting • Previous example doesn’t take place in real life (just an illustration) Backtrack + 50% • Matrix Mult. with URUK : • strip-mine by factor 4 (x3) • interchange loops (x2) • unroll 23

Let be the transformation function for a statement • Suppose is invertible, and let the matrix of denominators of • Let and Removing Modulos – Domain Iterator Regen. Time iterators Domain iterators • Inverse Scatter Matrix expresses domain iterators from time iterators • ensures all coefficients are integral • Replaces leaf polyhedral projections by matrix inversions Substitute for usual Hermite Normal Form in stride computations Problem since 91: [Irigoin91], [Pingali92], [Ramanujam95], [Xue96], [Griebl98] and others … 24

0 1 -1 0 0 1 2 3 1 0 2 3 1 0 1/3 -2/3 0 3 1 0 1 -2 0 -3 SM & unroll t2 by (3 / gcd(2,3)) SM & unroll t1 by (3 / gcd(1,3)) Removing Modulos - Inverse Scatter Matrix • Consider S with 2 domain iterators, = and = • We have = and ISM = INTEGRAL Meaning: i = t2 , 3*j = t1-2*t2 Time iterators Domain iterators = for(t1=5;t1<=2*M+3*N;t1++) for(t2=?;t2<=?;t2++) if(t1%3 == 0) S(i=t2,j=t1/3-k) (t2 = 3k) if(t1%3 == 2) S(i=t2,j=(t1-2)/3-k) (t2 = 3k+1) if(t1%3 == 1) S(i=t2,j=(t1-1)/3-k-1) (t2 = 3k+2) for(t1=5;t1<=2*M+3*N;t1++) for(t2=1;t2<=min(M,t1/2);t2++) if((t1 – 2t2)%3 == 0) S(i=t2,j=(t1-2t2)/3) for(i=1;i<=M;i++) for(j=1;j<=N;j++) S(i,j) for(t1=?;t1<=?;t1++) for(t2=?;t2<=?;t2++) S(i=t2,j=l-k) (t1 = 3l) S(i=t2,j=l-k-1) (t1 = 3l+1) S(i=t2,j=l-k-2) (t1 = 3l+2) OUCH !!! 25

Removing Modulos – There is a CATCH • Previous example flowed nicely  What about the loops’ bounds ??? • “Issue” (feature) with our SM + unroll transformation (strip-mine NOT strided) • Modulos are indeed removed from the kernels only for(i=M;i<N;i+=2) for(ii=i; ii<=min(i+1,N); i++) S(i,j) for(i=M;i<=N;i++) S(i,j) V.S. P K Code Size E HOWEVER: P and E have marginal execution time when SM factor is “decent” PROLOGUE gives us ALIGNMENT on %2 (strip-mine factor) !!!!!!!!!!!!!!! Transformation quality issue 26

Removing Modulos – Hermite Normal Form • Our solution unrolls modulo guards from kernels after strip-mining • Hermite Normal Form: Mathematical decomposition of = U.H • Where U is unimodular (skewing matrix) • H is diagonal (stride in transformed space  diagonal coefficients) • Suppresses the need for internal modulo guards BUT • If U is not the same, skewing are different • Deal with non parallel lattices … how ? • In practice, used for 1 statement or “simple” examples All statements need to have the same transformation TOO RESTRICTIVE 27

Experimental Results 28

Putting it all Together – Code Size Experiments State of the art polyhedral code generator  CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG04 vs CLooG06 CL04 CL06 Improv. 29

% of CL04 Node Fusion % of CL04 Scalar Dimensions Generation Speed – Experiments We compare original CLooG (CL04) from [Bastiul04] PACT paper with our optimized CLooG (CL06) • Swim • 36% Time reduction • 58% Memory reduction Domain Iterators 30

Putting it all Together – Code Generation Speed Experiments State of the art polyhedral code generator  CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG vs URGenT CL UR CL UR Affine Schedule: 412  2267 lines (40% execution speedup wrt best peak) Pathscale –Ofast needs ~22s to process the AST (LNO OFF) 31

Conclusion / Future Works • Implemented as the Code Generation phase of the URUK framework [ICS05] • Generation Speed Goal achieved (up to 56x, stands PathScale comparison) • Greatly improved code size with improved if-hoisting technique (up to 5.8x) • Modulo Conditionals are removed (from kernel)  Mix with HNF • Still room for speeding up generation (caches, memory pools, parallelization) • Focus on Code Generation Friendly transformations 32

Thank you !!!www.cloog.org for full presentation & more

Polyhedral Code Generation In The Real World

Polyhedral Code Generation In The Real World

Presentation Transcript

Code generation

Code Generation

Code Generation

POLYHEDRAL

Code Generation

Code Generation

Code Generation

Code Generation

Code Generation

Code Generation

Code Generation

Code Generation in CDE

Code Generation

Code Generation

Code Generation

Issues in Code Generation

Code Generation

Code Generation

Code Generation

Code Generation

Code Generation