160 likes | 293 Views
Affine Partitioning for Parallelism & Locality. Amy Lim Stanford University http://suif.stanford.edu/. INTERCHANGE FOR i FOR j FOR j FOR i A[i,j]= A[i,j] =
E N D
Affine Partitioning for Parallelism & Locality Amy Lim Stanford University http://suif.stanford.edu/
INTERCHANGE FOR i FOR j FOR j FOR i A[i,j]= A[i,j] = REVERSAL FOR i=1 to n FOR i= n downto 1 A[i]= A[i] = SKEWING FOR i=1 TO n FOR i=1 TO n FOR j=1 TO n FOR k=i+1 to i+n A[i,j] = A[i,k-i] = FUSION/FISSION FOR i = 1 TO n FOR i = 1 TO n A[i] = A[i] = FOR i = 1 TO n B[i] = B[i] = REINDEXING FOR i = 1 to n A[1] = B[0] A[i] = B[i-1] FOR i = 1 to n-1 C[i] = A[i+1] A[i+1] = B[i] C[i] = A[i+1] C[n] = A[n+1] Traditional approach: is it legal & desirable to apply one transform? Useful Transforms for Parallelism&Locality
Affine mappings [Lim & Lam, POPL 97, ICS 99] Domain: arbitrary loop nesting, affine loop indices; instruction optimized separately Unifies Permutation Skewing Reversal Fusion Fission Statement reordering Supports blocking across all (non-perfectly nested) loops Optimal:Max. deg. of parallelism & min. deg. of synchronization Minimize communication by aligning the computation and pipelining Question: How to combine the transformations?
Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)
Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors
Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A L B L EPSS L
A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j]; (S1) B[i,j] = A[i,j-1]*B[i,j]; (S2) S1 i S2 j
Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S2) for i1 = max(1,1+p) to min(n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1) B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.
Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Cjwhich maps an instance of statement jto a processor: ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik) with the objective of maximizing the rank of Cj F1(i1) Array Loops F2(i2) C1(i1) C2(i2) Processor ID Maximum Parallelism & No Communication
ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik) Cj (ij) = Ck (ik) Rewrite partition constraints as systems of linear equations use affine form of Farkas Lemma to rewrite constraints assystems of linear inequalities in C and l use Fourier-Motzkin algorithm to eliminate Farkas multipliers l and get systems of linear equations AC =0 Find solutions using linear algebra techniques the null space for matrix A is a solution of C with maximum rank. Algorithm
PipeliningAlternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N (parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))
Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Tjwhich maps an instance of statement jto a time stage: ij, ik Bjij 0, Bkik 0 ( ij ik) (Fxj ( ij) = Fxk ( ik)) Tj (ij) Tk (ik) lexicographically with the objective of maximizing the rank of Tj Finding the Maximum Degree of Pipelining F1(i1) Array Loops F2(i2) T1(i1) T2(i2) Time Stage
Key Insight • Choice in time mapping => (pipelined) parallelism • Degrees of parallelism = rank(T) - 1
Putting it All Together • Find maximum outer-loop parallelism with minimum synchronization • Divide into strongly connected components • Apply processor mapping algorithm (no communication) to program • If no parallelism found, • Apply time mapping algorithm to find pipelining • If no pipelining found (found outer sequential loop) • Repeat process on inner loops • Minimize communication • Use a greedy method to order communicating pairs • Try to find communication-free, or neighborhood only communication by solving similar equations • Aggregate computations of consecutive data to improve spatial locality
Use of Affine Partitioning in Locality Opt. • Promotes array contraction • Finds independent threads and shortens the live ranges of variables • Supports blocking of imperfectly nested loops • Finds largest fully permutable loop nest via affine partitioning • Fully permutable loop nest -> blockable