1 / 44

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism. Chapter 5 of Allen and Kennedy Mirit & Haim. Overview. Motivation Loop Interchange Scalar Expansion Scalar and Array Renaming. Motivational Example. Matrix Multiplication: DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L

guiliaine
Download Presentation

Enhancing Fine-Grained Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim

  2. Overview • Motivation • Loop Interchange • Scalar Expansion • Scalar and Array Renaming

  3. Motivational Example • Matrix Multiplication: DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO • Codegen: tries to find parallelism using transformations of loop distribution and statement reordering • codegen will not uncover any vector operations because of the use of the temporary scalar T which creates several dependences carried by each of the loops.

  4. Motivational Example I Replacing the scalar T with the vector T$ will eliminate most of the dependences: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

  5. T$(1:N) = 0.0 C(1:N,J) = T$(1:N) Motivational Example II Distribution of the I-loop gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO

  6. Motivational Example III Interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO I = 1,N DO K = 1,L DO K = 1,L DO I = 1,N T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO C(1:N,J) = T$(1:N) ENDDO

  7. Motivational Example IV Finally, Vectorizing the I-Loop: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L DO I = 1,N T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO

  8. Motivational Example - Summary We used two new transformations: • Loop interchange (replacing I,K loops) • Scalar Expansion (T => T$)

  9. Loop Interchange Loop Interchange - Switching the nesting order of two loops DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B ENDDO ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B ENDDO ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO Impossible to vectorize since there is a cyclic dependence from S to itself, carriedby the J-Loop.

  10. S(1,4) S(2,4) S(3,4) S4,4) J = 4 S(1,3) S(2,3) S(3,3) S(4,3) J = 3 S(1,2) S(2,2) S(3,2) S(4,2) J = 2 S(1,1) S(2,1) S(3,1) S(4,1) J = 1 I = 1 I = 2 I = 3 I = 4 Loop Interchange: Safety • Safety: not all loop interchanges are safe. • S(I,J) – the instance of statement S with the parameters (iteration numbers) I and J DO J = 1, M DO I = 1, N S: A(I,J+1) = A(I+1,J) ENDDO ENDDO A(2,2) is assigned at S(2,1)and used at S(1,2).

  11. Loop Interchange: Safety • A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

  12. Distance Vector - Reminder • Consider a dependence in a loop nest of n loops • Statement S1 on iteration i is the source of the dependence • Statement S2 on iteration j is the sink of the dependence • The distance vector is a vector of length n d(i,j) such that: d(i,j)k = jk – ik

  13. “<” ifd(i,j)k > 0 • “=” if d(i,j)k = 0 • “>” if d(i,j)k < 0 D(i,j)k = Direction Vector - Reminder Suppose that there is a dependence from statement S1 on iterationi of a loop nest of n loops andstatement S2 on iteration j, then the dependence direction vector is D(i,j) is defined as a vector of length n such that

  14. Direction Vector - Reminder DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • (I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=) • (I+1,J+1,K) - (I,J+1,K+1) =(1,0,-1) => (<,=,>)

  15. Direction Vector - Reminder • We say that the I’th loop carries a dependence with direction-vector d, if the leftmost non-”=“ in d is in the I’th place. (<,<,=) (<,=,>) • It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence.

  16. Loop Interchange: Safety • Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops is determined by applying the same permutation to the elements of D(i,j). DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • (I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=) => (<,=,<)

  17. < < = < = > Loop Interchange: Safety • The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row. DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • (1,1,0) => (<,<,=) • (1,0,-1) => (<,=,>)

  18. i j k < < = < = > j i k < < = = < > j k i < = < = > < Loop Interchange: Safety • Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non- "=" direction in any row.

  19. I J = < < < J I < = < < Loop Interchange : profitability In general, our goal is to enhance vectorization by increasing the number of innermost loops that do not carry any dependence: DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + A(0:N-1,J) + B ENDDO

  20. i j k < < = Loop Interchange: profitability • Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO • For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO

  21. A(I,1:M,1:L) j k i < = < Loop Interchange: profitability • Since Fortran stores in column-major order we would like to vectorize theI-loop • Thus, after Interchanging the I-loop to be the inner loop, we can transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO

  22. k j i = < < Loop Interchange: profitability • MIMD machines with vector execution units want to cut down synchronization costs, therefore, shifting the K-loop to outermost level will get: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

  23. < = < = = = = < > < < = = = = = < < = = = = = < < = = < = = = < < > < = = = < = = < = = = = = < Loop Shifting • Motivation: Identify loops which can be moved and move them to “optimal” nesting levels. • Trying all the permutations is not practical. • Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position.

  24. I J K = = < K I J < = = A(1:N,1:N,k) = A(1:N,1:N,k+1) Loop Shifting • Heuristic – move the loops that carry no dependence to the innermost position: DO I = 1, N DO J = 1, N DO K = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO DO K= 1, N DO I = 1, N DO J = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO

  25. I J < < = < J I < < < = Loop Shifting • The heuristic fail in: DO I = 1, N DO J = 1, M A(I+1,J+1) = A(I,J) + A(I+1,J) ENDDO ENDDO • But we still want to get: DO J = 1, M A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J) ENDDO

  26. Loop Shifting • A recursive heuristec: • If the outermost loop carries no dependence, then let the outermost loop that carries a dependence to be the outermost loop. (the former heuristic.) • If the outermost loop carries a dependence, find the outermost loop L that: • L can be safely shifted to the outermost position • L carries a dependence d whose direction vector contains an "=" in every position but the Lth. And move L to the outermost position.

  27. < < = = = = < < = < = = = = = < < < = = = = < < < = = = = = = < < = < = = < = < < = = = = = = < < = = < = < < = < = = = = = < = Loop Shifting

  28. Scalar Expansion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO • Scalar Expansion : DO I = 1, N T$(I) = A(I) A(I) = B(I) B(I) = T$(I) ENDDO T = T$(N) • leads to: T$(1:N) = A(1:N) A(1:N) = B(1:N) B(1:N) = T$(1:N) T = T$(N)

  29. Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I-1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N T$(I) = T$(I-1) + A(I) + A(I-1) A(I) = T$(I) ENDDO T = T$(N)

  30. Scalar Expansion – How to • Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. • Dependences due to reuse of memory location vs. reuse of values: • Dependences due to reuse of values must be preserved • Dependences due to reuse of memory location can be deleted by expansion • We will try to find the removable dependence

  31. X= X= X= =X =X =X Reminder - SSA

  32. T=X(I) Y(I)=T Covering definition • A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. DO I = 1, 100 S1 T = X(I) S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) S2 Y(I) = T ENDIF ENDDO

  33. T=X(I) Y(I)=T Covering definition • A covering definition does not always exist: DO I = 1, 100 IF (A(I) .GT. 0) THEN S2 T = X(I) ENDIF S3 Y(I) = T ENDDO • In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop

  34. T=-X(I) T=X(I) Y(I)=T Covering definition • Expanded definition of covering definition: A collection C of covering definitions for T in a loop in which every control path has a covering definition. DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = -X(I) ENDIF S3 Y(I) = T ENDDO

  35. Find Covering definitions DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO • To form a collection of covering definitions, we can insert dummy assignments: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = T ENDIF S3 Y(I) = T ENDDO

  36. T=X(I) Y(I)=T T=X(I) T=T Y(I)=T Find Covering definition • Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: • Central idea: Look for parallel paths to a -function following the first assignment, and add dummy assignments. DO I = 1, 100 IF (A(I) .GT. 0) THEN T = X(I) ELSE T = T ENDIF Y(I) = T ENDDO Detailed algorithm: • Let S0 be the -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack. • Let S1 be the first definition of T in the loop. Add S1 to C. • If the SSA successor of S1 is a -function S2 that is not equal to S0, then push S2 onto the stack and mark it; • While the stack is non-empty, • pop the -function S from the stack; • add all SSA predecessors that are not -functions to C; • if there is an SSA edge from S0 into S, then insert the assignment T = T as the last statement along that edge and add it to C; • for each unmarked -function S3 (other than S0) that is an SSA predecessor of S, mark S3 and push it onto the stack; • for each unmarked -function S4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S4 and push it onto the stack.

  37. Scalar Expansion Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: • Create an array T$ of appropriate length • For each S in the covering definition collection C, replace the T on the left-hand side by T$(I). • For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I). • For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1). • If S0 is not null, then insert T$(0) = T before the loop. • If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound. T$(0) = T DO I = 1, 100 IF (A(I) .GT. 0) THEN S1T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIF S2 Y(I) = T$(I) ENDDO

  38. Deletable dependence: • Given the covering Definitions, we can identify deletable dependence: • Backward carried antidependences: • Writing to T[i] after reading from T[i-1] at the previous iteration • Backward carried output dependences • Writing to T[i] after writing to T[i-1] at the previous iteration • Forward carried output dependences • Writing to T[i] and writing to T[i+1] at the next iteration • Loop-independent antidependences into the covering definition • Writing to T[i] after reading from T[i-1] at the same iteration • Loop-carried true dependences from a covering definition • reading from T[i] after writing to T[i-1] at the previous iteration

  39. DO I = 1, N,64 • DO J = 0,63 • T = A(I+J) +A(I+J+1) • A(I) = T + B(I+J) • ENDDO • ENDDO • DO I = 1, N • T = A(I) + A(I+1) • A(I) = T + B(I) • ENDDO • DO I = 1, N • T = A(I) + A(I+1) • A(I) = T + B(I) • ENDDO • DO I = 1, N • A(I) = A(I) + A(I+1) + B(I) • ENDDO Scalar Expansion: Drawbacks • Expansion increases memory requirements • Solutions: • Expand in a single loop • Strip mine loop before expansion: • Forward substitution:

  40. Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO DO I = 1, 100 S1 T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3 T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

  41. Scalar Renaming • Renaming algorithm partitions all definitions and uses into equivalent classes using the definition-use graph: • Pick definition • Add all uses that the definition reaches to the equivalence class • Add all definitions that reach any of the uses… • ..until fixed point is reached

  42. Scalar Renaming: Profitability • Scalar renaming will remove a loop-independent output dependence or antidependence. • (Writing to the same scalar at the same iteration or writing to the scalar after reading from it.) • Relatively cheap to use scalar renaming. • Usually done by compilers when calculating live ranges for register allocation.

  43. Array Renaming DO I = 1, N S1 A(I) = A(I-1) + X S2 Y(I) = A(I) + Z S3 A(I) = B(I) + C ENDDO • Rename A(I) to A$(I): DO I = 1, N S1 A$(I) = A(I-1) + X S2 Y(I) = A$(I) + Z S3 A(I) = B(I) + C ENDDO A(1:N) =B(1:N) +C A$(1:N)=A(0:N-1)+X Y(1:N) =A$(1:N) +Z

  44. Array Renaming: Profitability • determine edges that are removed by array renaming and analyze effects on dependence graph • procedure array_partition: • Assumes no control flow in loop body • identifies collections of references to arrays which refer to the same value • identifies deletable output dependences and antidependences • Use this procedure to generate code • Minimize amount of copying back to the “original” array at the beginning and the end

More Related