1 / 16

Affine Partitioning for Parallelism & Locality

Affine Partitioning for Parallelism & Locality. Amy Lim Stanford University http://suif.stanford.edu/. INTERCHANGE FOR i FOR j FOR j FOR i A[i,j]= A[i,j] =

stan
Download Presentation

Affine Partitioning for Parallelism & Locality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Affine Partitioning for Parallelism & Locality Amy Lim Stanford University http://suif.stanford.edu/

  2. INTERCHANGE FOR i FOR j FOR j FOR i A[i,j]= A[i,j] = REVERSAL FOR i=1 to n FOR i= n downto 1 A[i]= A[i] = SKEWING FOR i=1 TO n FOR i=1 TO n FOR j=1 TO n FOR k=i+1 to i+n A[i,j] = A[i,k-i] = FUSION/FISSION FOR i = 1 TO n FOR i = 1 TO n A[i] = A[i] = FOR i = 1 TO n B[i] = B[i] = REINDEXING FOR i = 1 to n A[1] = B[0] A[i] = B[i-1] FOR i = 1 to n-1 C[i] = A[i+1] A[i+1] = B[i] C[i] = A[i+1] C[n] = A[n+1] Traditional approach: is it legal & desirable to apply one transform? Useful Transforms for Parallelism&Locality

  3. Affine mappings [Lim & Lam, POPL 97, ICS 99] Domain: arbitrary loop nesting, affine loop indices; instruction optimized separately Unifies Permutation Skewing Reversal Fusion Fission Statement reordering Supports blocking across all (non-perfectly nested) loops Optimal:Max. deg. of parallelism & min. deg. of synchronization Minimize communication by aligning the computation and pipelining Question: How to combine the transformations?

  4. Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

  5. Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors

  6. Optimizing Arbitrary Loop Nesting Using Affine Partitions DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) A L B L EPSS L

  7. Results with Affine Partitioning + Blocking

  8. A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j]; (S1) B[i,j] = A[i,j-1]*B[i,j]; (S2) S1 i S2 j

  9. Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S2) for i1 = max(1,1+p) to min(n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1) B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.

  10. Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Cjwhich maps an instance of statement jto a processor:  ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik)  Cj (ij) = Ck (ik) with the objective of maximizing the rank of Cj F1(i1) Array Loops F2(i2) C1(i1) C2(i2) Processor ID Maximum Parallelism & No Communication

  11.  ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik)  Cj (ij) = Ck (ik) Rewrite partition constraints as systems of linear equations use affine form of Farkas Lemma to rewrite constraints assystems of linear inequalities in C and l use Fourier-Motzkin algorithm to eliminate Farkas multipliers l and get systems of linear equations AC =0 Find solutions using linear algebra techniques the null space for matrix A is a solution of C with maximum rank. Algorithm

  12. PipeliningAlternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N (parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))

  13. Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Tjwhich maps an instance of statement jto a time stage:  ij, ik Bjij 0, Bkik 0 ( ij ik)  (Fxj ( ij) = Fxk ( ik))  Tj (ij) Tk (ik) lexicographically with the objective of maximizing the rank of Tj Finding the Maximum Degree of Pipelining F1(i1) Array Loops F2(i2) T1(i1) T2(i2) Time Stage

  14. Key Insight • Choice in time mapping => (pipelined) parallelism • Degrees of parallelism = rank(T) - 1

  15. Putting it All Together • Find maximum outer-loop parallelism with minimum synchronization • Divide into strongly connected components • Apply processor mapping algorithm (no communication) to program • If no parallelism found, • Apply time mapping algorithm to find pipelining • If no pipelining found (found outer sequential loop) • Repeat process on inner loops • Minimize communication • Use a greedy method to order communicating pairs • Try to find communication-free, or neighborhood only communication by solving similar equations • Aggregate computations of consecutive data to improve spatial locality

  16. Use of Affine Partitioning in Locality Opt. • Promotes array contraction • Finds independent threads and shortens the live ranges of variables • Supports blocking of imperfectly nested loops • Finds largest fully permutable loop nest via affine partitioning • Fully permutable loop nest -> blockable

More Related