410 likes | 1.08k Views
Symbolic Program Transformation for Numerical Codes. Vijay Menon Cornell University. Background. Compiler/Runtime Support for Numerical Programs (under Keshav Pingali) Sparse matrix code generation Memory hierarchy optimizations Compiler/runtime techniques for MATLAB
E N D
Symbolic Program Transformation for Numerical Codes Vijay Menon Cornell University
Background • Compiler/Runtime Support for Numerical Programs (under Keshav Pingali) • Sparse matrix code generation • Memory hierarchy optimizations • Compiler/runtime techniques for MATLAB • MAJIC (with University of Illinois) • MultiMATLAB
Motivation • Using matrix properties to generate efficient code • e.g., Associativity, Commutativity of matrix operations • Problem: loop notation • Matrix operations hidden within loop nests
Example: LU Factorization with Partial Pivoting • Also called: Gaussian Elimination • Key algorithm for solving systems of linear equations: • To solve A x = b for x: • => Factor A into L U • => Solve L y = b for y • => Solve U x = y for x
LU Factorization do j = 1,N p(j) = j; do i = j+1,N if (A(i,j)>A(p(j),j)) p(j) = i; do k = 1,N tmp = A(j,k); A(j,k) = A(p(j),k); A(p(j),k) = tmp; do i = j+1,N A(i,j) = A(i,j)/A(j,j); do k = j+1,N do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k); Select pivot row: Swap pivot row with current: Scale column (to store L): Update(to compute partial U): • Problem: Poor cache behavior x x x x x x 0 x x x x x 0 0 5 x x x 0 0 3 x x x 0 0 7x x x 0 0 2 x x x
Automatic Blocking do jB = 1,N,B do j = jB,jB+B-1 p(j) = j; do i = j+1,N if (A(i,j)>A(p(j),j)) p(j) = i; do k = 1,N tmp = A(j,k); A(j,k) = A(p(j),k); A(p(j),k) = tmp; do i = j+1,N A(i,j) = A(i,j)/A(j,j); do k = j+1,jB+B-1 do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k); do j = jB,jB+B-1 do k = jB+B,N do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k); • Compiler transformations: • Stripmine • Index-Set Split • Loop Distribute • Tile/Map to BLAS • Problems: • Establishing legality • Mapping to BLAS
Key Points • Conventional techniques (dependence analysis) are insufficient to establish legality • Detection of matrix operations buys extra performance
Overview of this Talk • Fractal Symbolic Analysis • Framework for Symbolically determining Legality of Program Transformations • Matrixization • Generalization of Vectorization to Matrix Operations
Fractal Symbolic Analysis • Symbolic test to establish legality of program transformations for i = 1 : n S1(i); for i = 1 : n S2(i); for i = 1 : n S1(i); S2(i);
Dependence Analysis • Independent operations may be reordered • Legality: No data dependence from S2(m) to S1(l) where l > m for i = 1 : n S1(i); for i = 1 : n S2(i); for i = 1 : n S1(i); S2(i);
Symbolic Comparison • Dependence Analysis is conservative: • Symbolic execution shows equality: • aout = 2*(ain + bin) • bout = 2*bin • But, intractable for recurrent loops s1: a = 2*a s2: b = 2*b s3: a = a+b s3: a = a+b s1: a = 2*a s2: b = 2*b ?
Distillation of LU • Given p(j) j, prove: • Dependence analysis: too conservative • Symbolic comparison: intractable for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for j = 1:n for i = j+1:n a(i) = a(i)/a(j) for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for i = j+1:n a(i) = a(i)/a(j) B1(j): B1(j): B2(j): B2(j):
Idea • Simplify • Prove equality of complex programs via equality of simpler programs • Repeat if necessary • Sufficient, but not necessary • equality of simpler programs -> equality of original programs
Commutativity S1, S2, S3, …,Sn = Sf(1), Sf(2), Sf(3), …,Sf(n) i,j 1..n | i < j f(i) > f(j). Si; Sj = Sj; Si • Permutation Sequence of Adjacent Transpositions
Commutative Legality • Loop Distribution • Legality: 1 m < l N. S2(m); S1(l) = S1(l); S2(m) for i = 1 : n S1(i); S2(i); for i = 1 : n S1(i); for i = 1 : n S2(i);
Commutative Legality • Similar conditions: • Statement reordering • Loop interchange • Loop reversal • Loop tiling • Compiler establishes legality of transformation by testing commute condition
Back to Example • Given p(j) j, prove: • Simplified comparison => for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for j = 1:n for i = j+1:n a(i) = a(i)/a(j) for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp for i = j+1:n a(i) = a(i)/a(j) B1(j): B1(j): B2(j): B2(j):
Simplified Comparison • Given p(j) j l > m prove: • Further simplification? for i = m+1:n a(i) = a(i)/a(m) tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp for i = m+1:n a(i) = a(i)/a(m) B2(m): B1(l): B1(l): B2(m):
Too Simple! • Given p(j) j l > m i > m prove: • Too conservative - no longer legal! • Hypothesis: • Repeated application -> dependence analysis tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp a(i) = a(i)/a(m) a(i) = a(i)/a(m) tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp
Symbolic Comparison • Can we symbolically prove?: • Effect can be summarized: • non-recurrent loops • affine indices/loop bounds tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp for i = m+1:n a(i) = a(i)/a(m) for i = m+1:n a(i) = a(i)/a(m) tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp B2(m): B1(l): B1(l): B2(m):
Guarded Expressions • Compare symbolic values of live output variable in terms of input variables: • aout (k)= • Each • guard : affine expression defining part of aout • expr : symbolic expression on input data guard1 -> expr1 guard2 -> expr2 …… guardn -> exprn
Guarded Expressions • For both program blocks: • aout(k) = • Omega Library (integer programming tool) to manipulate/compare affine guards k m => ain(k) k = l => ain(p(l))/ain(m) k = p(l) => ain(l)/ain(m) else => ain(k)/ain(m)
Summary of Fractal Symbolic Analysis • Powerful legality test • Explores tradeoffs between • tractability (dependence analysis) • accuracy (symbolic comparison) • Similar application for LU w/ pivoting • 2 recursive simplification steps • 6 guarded regions/expressions • Prototype implemented in OCAML
Overview of this Talk • Fractal Symbolic Analysis • Framework for Symbolically determining Legality of Program Transformations • Matrixization • Generalization of Vectorization to Matrix Operations
Matrixization • Detect Matrix Operations • Map to • hand-optimized libraries (BLAS) • special-purpose hardware • Eliminate loop overhead • MATLAB: type/bounds checks • Exploit Matrix Semantics • Ring Properties of Matrix (+,*)
Galerkin Example • In Galerkin, 98% of time spent in: • for i = 1:N • for j = 1:N • phi(k) += A(i,j)*x(i)*y(i); • end • end • A - Matrix • x, y - Vector
Vectorized Code • In Optimized Galerkin: phi(k) += x*A*y’; • Fragment Speedup: 260 • Program Speedup: 110
Conventional Approaches • Syntactic Pattern Replacement (KAPF,VAST,FALCON) • Can encode high-level properties • Limited use on loops • Vector Code Generation • Can detect array operations in loops • Cannot detect/exploit matrix products
Our Approach • Map code to: Abstract Matrix Form (AMF) • Convert to Symmetric AMF formulation • Optimize AMF expressions • factorization • invariant removal • Detect Matrix Products • Map AMF to MATLAB/BLAS
Expansion • Array expressions: • Forall loops: a(1:m,1:n) -> i j ai,j for i = 1:n x(i) = x(i) + y(i); -> ixi = i(xi+yi) end
Reduction • Sum Reduction Loops for i = 1:n k = k + x(i); -> k = k + î i xi end
Product • Matrix Product C = A*B -> ijCi,j= P(ijAi,h,ijBh,j) • Product-Reduction Equivalence: • e1 and e2 are scalar • e1 is constant w.r.t. ik+1…in • e2 is constant w.r.t. i1…ik-1 îk (i1…ine1 * i1…ine2) = Pîk(i1…ike1, ik…ine2)
Other AMF Properties • Distributive Properties • e1·î e2 = î((ie1) ·e2) • when e1 is constant w.r.t. i • Interchange Properties • if(e1, e2,…, en) = f(ie1, ie2,…, ien) • i e = i e • where î= • i j e = j i e • î e = î e
Back to Galerkin • Original: • k = k + îi j (ai,j * xi * yj) • Convert to Symmetric Form: • k = k + î(i j ai,j * i j xi * i j yj) • Optimize: • k = k + îi xi * (i j ai,j * i j yj) • Map to Matrix Operations: • k = k + Pî (i xi , P(i j ai,j,j yj)) • => k = k + x’*A*y;
Other Vectorization Examples • Taken from USENET: for i = 1:n for j = 1:n C(i,j) = A(i,j)*x(i); C(1:n,1:n) = A(1:n,1:n) .* repmat(x(1,n),1,n) for i = 1:n x(i) = y(:,i)*A*y(:,i)’; x(1:n) = sum(y.*(y*A’),2);
Summary/Status of Matrixization • General technique to detect matrix/vector operations • Implemented in MAJIC • MATLAB interpreter/compiler • Future work: • Extend to more ops: recurrences • JIT Matrixization
Related Work • Commutativity Analysis • Rinard & Diniz • Parallelization of OO programs • High-Level Pattern Replacement • DeRose, Marsolf, … • Exploit high-level matrix properties • Structure • Symmetry • Orthogonality
Conclusion • High-level symbolic techniques: • Fractal Symbolic Analysis • Matrixization • New techniques to analyze & exploit underlying computation and achieve substantial performance gains