350 likes | 507 Views
Polyhedral Code Generation In The Real World. Nicolas VASILACHE Cédric BASTOUL Albert COHEN. Outline. Introduction Affine schedules Formal General Form Contributions Focus on Modulo Conditional Removal (speed & quality) Experimental Results. 2.
E N D
Polyhedral Code GenerationIn The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN
Outline • Introduction • Affine schedules • Formal General Form • Contributions • Focus on Modulo Conditional Removal (speed & quality) • Experimental Results 2
Introduction – Polyhedral Model • Powerful expressiveness for high level transformations (parallelism, locality) • Can express any composition of usual loop transformations [Pugh91] • Compact representation of all legal transformations [Feautrier90] • Code Generation was the weakest link [Griebl & al. 98] • Until recent algorithm [Quilleré00] without transformations • However, still problematic on long, parametric sequences on SPECs 3
SwimFP2000 [ICS05] ~ 30 polyhedral loop transformations 40% speedup wrt best peak perf. on AMD64 Goal : Generation time comparable to BE of a real compiler (EKOPath) Introduction – Transformations WHY TRANSFORM ??? Cholesky factorization, 6 statements, Optimal allocation functions [McKin92] • Huge code generation times (ex: full Swim ~ 421 2267 lines, 20 mn / 300 MB) • In the context of complex transformations 4
Code Generation : syntactic loops from matrix representation Introduction – Context & Notations 5
1 0 0 1 = Affine Schedule – Trivial Example j t2 time 3 3 (i,j) (t1=i, t2=j) domain j 3 1 1 1 1 3 i 1 3 t1 1 3 i for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t1,j=t2) time S(2,1) S(1,1) S(3,2) S(1,2) S(1,3) S(2,3) S(2,2) S(3,1) S(3,3) domain • Bijection between domain and time iterations • Time iterations determine the generated loops (nesting, bounds) • Execution follows lexicographic order on time dimensions • Domain values touched by the statement : i=t1,j=t2 7
0 1 1 0 = Affine Schedule – Loop Interchange j t2 time 3 3 (i,j) (t1=j, t2=i) domain j 3 1 1 1 1 3 i 1 3 t1 1 3 i for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t1=1;t1<=3;t1++) for(t2=1;t2<=3;t2++) S(i=t2,j=t1) time S(2,1) S(1,1) S(3,2) S(1,2) S(1,3) S(2,3) S(2,2) S(3,1) S(3,3) domain • Another bijection between domain and time iterations • New bounds computation • Lexicographic order on time dimensions • Domain values touched by the statement : i=t2,j=t1 8
Affine Schedule – Parallel Wavefronts j 3 (i,j) (t1= i+j) j domain time 3 = 1 1 1 2 6 1 t1 1 3 i 1 3 i for(t1=2;t1<=6;t1++) DOALL{(i,j)|i+j==t1} S(i,j) for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) time S(2,1) S(1,1) S(3,3) S(1,3) S(2,3) domain S(1,2) S(3,2) S(2,2) S(3,1) • NOT a bijection (just a surjection) • New bounds computation (t1:[2, 6]) • Domain values touched by the statement: {(i,j)|i+j==t1} 9
1 0 1 0 0 1 0 1 Affine Schedule – Statement Shifting j t2 time (i,j)1 (t1=i, t2=j) (i,j)2 (t1=i+1, t2=j) 3 3 domain j 0 = 3 1 1 0 1 1 1 = 1 2 2 0 1 3 i 1 4 t1 1 3 i for(t2=1;t2<=3;t2++) S1(i=1,j=t2) for(t1=2;t1<=3;t1++) for(t2=1;t2<=3;t2++) S1(i=t1,j=t2) S2(i=t1-1,j=t2) for(t2=1;t2<=3;t2++) S2(i=4-1,j=t2) P for(i=1;i<=3;i++) for(j=1;j<=3;j++) S1(i,j) S2(i,j) time K domain E • New bounds computation (S1: [1,3]x[1,3] S2: [2,4]x[1,3]) have disjoint parts • Separation phase needed on each time dimension (3nb_stmtw.c. complexity) 10
General Case • Schedules: Zmi Zni for each statement Si • Schedules associate logical time to each iteration domain point • Time value sets need to be separated scattering functions Time iterators Domain iterators Time Domain • Time part used for separation and ordering (Polylib computations 2dim[Wilde93]) • Domain part determines the values spanned by time dimensions • Quilleré separation phase [Quilleré00, Bastoul04] 11
Separation Principles • Consider statements with domain and schedule functions such that: • S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) • S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) • S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) Considering t1 worklist remaining [2,5] [0,3] Polyhedral inter / diff (2dim) [0,1] [4,5] [2,3] 13
Separation Principles • Consider statements with domain and schedule functions such that: • S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) • S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) • S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) Considering t1 worklist remaining [0,1] [4,5] [-2,6] [2,3] Polyhedral inter / diff (2dim) 3nb_stmt w.c. compl. kernel [2,3] [-2,-1] [0,1] [0,1] [-2,-1] 14
Separation Principles • Consider statements with domain and schedule functions such that: • S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) • S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) • S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) for(t1=-2;t1<=-1;t1++) for(t2=5;t2<=9;t2++) S3(…) for(t1=0;t1<=1;t1++) for(t2=5;t2<=7;t2++) S2(…) S3(…) for(t2=8;t2<=9;t2++) S3(…) ... Considering t1 That was for the first time dimension Recursively for all time dimensions Result is a syntax tree of the generated loops 15
Contributions State of the art polyhedral code generator CLooG [Bastoul04] ALL PERFORMANCE COMPARISONS WILL BE CLooG vs URGenT Real World Issues • Problems provided by different sources (academia, industry, SPECFP2000) • Exhibit different challenging issues Code Generation Speed • Node fusion (exploiting transformations’ “locality”) • Exploiting scalar dimensions (replacing exponential computations with trivial ones) • Domain iterator mapping improvement (replacing exponential by matrix inversions) Code Quality • Faster If-Hoisting yielding much smaller code (conditional factorization) • Modulo Conditional removal by strip-mining (stride issue) (detailed) 16
Generation Speed – Node Fusion • Multidimensional schedules allow expression of non affine (polynomial) quantities as affine ones with more dimensions improved flexibility • Drawback Pressure on code generation (height of the tree) • Add parameters Add dimensions (polyhedral operation complexity) HOWEVER • Loop level transformations affect blocks of statements (tiling, interchange…) • Polyhedron inclusion check is NOT exponential Before each separation phase, fuse consecutive nodes with equal scattering polyhedra. 18
Generation Speed – Scalar Dimensions • Some multidimensional schedules have scalar dimensions (UTF, URUK[ICS05]) • Scalar dimensions express strict statement interleaving • Comparison of integers, no need for polyhedral separation • Syntactic tree height reduction (potentially half the height) • Marginal overhead for detection and computation • Combines well with Node Fusion 19
Generation Speed – Domain Iterator Regen. • Generation of sequential loops for non invertible schedules (wavefronts) • CLooG [Bas04] handles it with polyhedral projection on domain iterators • Drawback Adds dimensions (polyhedral operation complexity) • Drawback Additional polyhedral computations on each leaf ST after Qui. separation Phase (3nb_stmts) Use transformation invertibility (ideally, given the rank, mix of projections and invertibility) 20
t2 for t1 t1 … … for t2 cond: t1<= 4 for t3 cond: 11 <= t1 cond: 5 <= t1<=10 t1 for t1 for t2 for t3 Code Quality – If Hoisting • Quilleré separation phase leaves conditionals on triangular loops • Need of the so-called backtracking phase too aggressive (code bloat) • Potentially tremendous amount of useless work Smaller Code No useless work (simplification IS needed) Explains the generation speedup on dreamupT3 Code Bloat If-Hoisting illustration Useless Work Backtracking illustration 22
Code Quality – If Hoisting • Previous example doesn’t take place in real life (just an illustration) Backtrack + 50% • Matrix Mult. with URUK : • strip-mine by factor 4 (x3) • interchange loops (x2) • unroll 23
Let be the transformation function for a statement • Suppose is invertible, and let the matrix of denominators of • Let and Removing Modulos – Domain Iterator Regen. Time iterators Domain iterators • Inverse Scatter Matrix expresses domain iterators from time iterators • ensures all coefficients are integral • Replaces leaf polyhedral projections by matrix inversions Substitute for usual Hermite Normal Form in stride computations Problem since 91: [Irigoin91], [Pingali92], [Ramanujam95], [Xue96], [Griebl98] and others … 24
0 1 -1 0 0 1 2 3 1 0 2 3 1 0 1/3 -2/3 0 3 1 0 1 -2 0 -3 SM & unroll t2 by (3 / gcd(2,3)) SM & unroll t1 by (3 / gcd(1,3)) Removing Modulos - Inverse Scatter Matrix • Consider S with 2 domain iterators, = and = • We have = and ISM = INTEGRAL Meaning: i = t2 , 3*j = t1-2*t2 Time iterators Domain iterators = for(t1=5;t1<=2*M+3*N;t1++) for(t2=?;t2<=?;t2++) if(t1%3 == 0) S(i=t2,j=t1/3-k) (t2 = 3k) if(t1%3 == 2) S(i=t2,j=(t1-2)/3-k) (t2 = 3k+1) if(t1%3 == 1) S(i=t2,j=(t1-1)/3-k-1) (t2 = 3k+2) for(t1=5;t1<=2*M+3*N;t1++) for(t2=1;t2<=min(M,t1/2);t2++) if((t1 – 2t2)%3 == 0) S(i=t2,j=(t1-2t2)/3) for(i=1;i<=M;i++) for(j=1;j<=N;j++) S(i,j) for(t1=?;t1<=?;t1++) for(t2=?;t2<=?;t2++) S(i=t2,j=l-k) (t1 = 3l) S(i=t2,j=l-k-1) (t1 = 3l+1) S(i=t2,j=l-k-2) (t1 = 3l+2) OUCH !!! 25
Removing Modulos – There is a CATCH • Previous example flowed nicely What about the loops’ bounds ??? • “Issue” (feature) with our SM + unroll transformation (strip-mine NOT strided) • Modulos are indeed removed from the kernels only for(i=M;i<N;i+=2) for(ii=i; ii<=min(i+1,N); i++) S(i,j) for(i=M;i<=N;i++) S(i,j) V.S. P K Code Size E HOWEVER: P and E have marginal execution time when SM factor is “decent” PROLOGUE gives us ALIGNMENT on %2 (strip-mine factor) !!!!!!!!!!!!!!! Transformation quality issue 26
Removing Modulos – Hermite Normal Form • Our solution unrolls modulo guards from kernels after strip-mining • Hermite Normal Form: Mathematical decomposition of = U.H • Where U is unimodular (skewing matrix) • H is diagonal (stride in transformed space diagonal coefficients) • Suppresses the need for internal modulo guards BUT • If U is not the same, skewing are different • Deal with non parallel lattices … how ? • In practice, used for 1 statement or “simple” examples All statements need to have the same transformation TOO RESTRICTIVE 27
Putting it all Together – Code Size Experiments State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG04 vs CLooG06 CL04 CL06 Improv. 29
% of CL04 Node Fusion % of CL04 Scalar Dimensions Generation Speed – Experiments We compare original CLooG (CL04) from [Bastiul04] PACT paper with our optimized CLooG (CL06) • Swim • 36% Time reduction • 58% Memory reduction Domain Iterators 30
Putting it all Together – Code Generation Speed Experiments State of the art polyhedral code generator CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG vs URGenT CL UR CL UR Affine Schedule: 412 2267 lines (40% execution speedup wrt best peak) Pathscale –Ofast needs ~22s to process the AST (LNO OFF) 31
Conclusion / Future Works • Implemented as the Code Generation phase of the URUK framework [ICS05] • Generation Speed Goal achieved (up to 56x, stands PathScale comparison) • Greatly improved code size with improved if-hoisting technique (up to 5.8x) • Modulo Conditionals are removed (from kernel) Mix with HNF • Still room for speeding up generation (caches, memory pools, parallelization) • Focus on Code Generation Friendly transformations 32