1 / 46

Enhancing Fine-Grained Parallelism

Dive into enhancing parallelism using loop interchange, scalar expansion, and more to optimize modern compilers and architectures for efficient code generation and performance improvement.

Download Presentation

Enhancing Fine-Grained Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures

  2. Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: • Loop Interchange • Scalar Expansion • Scalar Renaming • Array Renaming

  3. We fail here Prelude: A Long Time Ago... procedurecodegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S1, S2, ... , Sm} of maximal strongly-connected regions in the dependence graph D restricted to R construct Rp from R by reducing each Si to a single node and compute Dp, the dependence graph naturally induced on Rp by D let {p1, p2, ... , pm} be the m nodes of Rp numbered in an order consistent with Dp (use topological sort to do the numbering); for i = 1 to m do begin if piis cyclic then begin generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi; codegen (pi, k+1, Di); generate the level-k ENDDO statement; end else generate a vector statement for pi in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi; end end

  4. Prelude: A Long Time Ago... • Codegen: tries to find parallelism using transformations of loop distribution and statement reordering • If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops • Goal in Chapter 5: To explore other transformations to exploit parallelism

  5. Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

  6. Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

  7. Motivational Example II • Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO

  8. Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO • A couple of new transformations used: • Loop interchange • Scalar Expansion

  9. Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B •DV: (=, <) ENDDO ENDDO • Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B •DV: (<, =) ENDDO ENDDO • leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO

  10. Loop Interchange • Loop interchange is a reordering transformation • Why? • Think of statements being parameterized with the corresponding iteration vector • Loop interchange merely changes the execution order of these statements. • It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S <some statement> ENDDO ENDDO • If interchanged, S(2, 1) will execute before S(1, 2)

  11. Loop Interchange: Safety • Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO ENDDO • Direction vector (<, >) • If we interchange loops, we violate the dependence

  12. Loop Interchange: Safety • A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

  13. Loop Interchange: Safety • A dependence is interchange-sensitiveif it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level.

  14. Loop Interchange: Safety • Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). • The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

  15. Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • The direction matrix for the loop nest is: < < = < = > • Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. • Follows from Theorem 5.1 and Theorem 2.3

  16. Loop Interchange: Profitability • Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO • For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO • Not suitable for vector register machines

  17. Loop Interchange: Profitability • For Vector machines, we want to vectorize loops with stride-one memory access • Since Fortran stores in column-major order: • useful to vectorize the I-loop • Thus, transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO

  18. Loop Interchange: Profitability • MIMD machines with vector execution units: want to cut down synchronization costs • Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

  19. Loop Shifting • Motivation: Identify loops which can be moved and move them to “optimal” nesting levels • Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position. • Proof:

  20. Loop Shifting DO I = 1, N DO J = 1, N DO K = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO ENDDO ENDDO • S has true, anti and output dependences on itself, hence codegen will fail as recurrence exists at innermost level • Use loop shifting to move K-loop to the outermost level: DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO ENDDO ENDDO

  21. Loop Shifting DO K= 1, N DO I = 1, N DO J = 1, N S A(I,J) = A(I,J) + B(I,K)*C(K,J) ENDDO ENDDO ENDDO codegen vectorizes to: DO K = 1, N FORALL J=1,N A(1:N,J) = A(1:N,J) + B(1:N,K)*C(K,J) END FORALL ENDDO

  22. Loop Shifting • Change body of codegen: ifpi is cyclic then if k is the deepest loop in pi thentry_recurrence_breaking(pi, D, k) else begin select_loop_and_interchange(pi, D, k); generate a level-k DO statement; let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to pi; codegen (pi, k+1, Di); generate the level-k ENDDO statement end end

  23. Loop Shifting procedure select_loop_and_interchange(i, D, k) if the outermost carried dependence in i is at level p>k then shift loops at level k,k+1,...,p-1 inside the level-p loop, making it into the level-k loop; return; end select_loop_and_interchange

  24. Loop Selection • Consider: DO I = 1, N DO J = 1, M S A(I+1,J+1) = A(I,J) + A(I+1,J) ENDDO ENDDO • Direction matrix: < < = < • Loop shifting algorithm will fail to uncover vector loops; however, interchanging the loops can lead to: DO J = 1, M A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J) ENDDO • Need a more general algorithm

  25. Loop Selection • Loop selection: • Select a loop at nesting level p  k that can be safely moved outward to level k and shift the loops at level k, k+1, …, p-1 inside it

  26. Loop p Direction vector Loop Selection • Heuristics for selecting loop level • If the level-k loop carries no dependence, then let p be the smallest integer such that the level-p loop carries a dependence. (loop-shifting heuristic.) • If the level-k loop carries a dependence, let p be the outermost loop that can be safely shifted outward to position k and that carries a dependence d whose direction vector contains an "=" in every position but the pth. If no such loop exists, let p = k. = = < > = . . . = = = < < . . . = = < = = . . .

  27. Scalar Expansion DO I = 1, N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO • Scalar Expansion: DO I = 1, N S1 T$(I) = A(I) S2 A(I) = B(I) S3 B(I) = T$(I) ENDDO T = T$(N) • leads to: S1 T$(1:N) = A(1:N) S2 A(1:N) = B(1:N) S3 B(1:N) = T$(1:N) T = T$(N)

  28. Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N S1 T$(I) = T$(I-1) + A(I) + A(I+1) S2 A(I) = T$(I) ENDDO T = T$(N)

  29. Scalar Expansion: Safety • Scalar expansion is always safe • When is it profitable? • Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. • However, we want to predict when expansion is profitable • Dependences due to reuse of memory location vs. reuse of values • Dependences due to reuse of values must be preserved • Dependences due to reuse of memory location can be deleted by expansion

  30. Scalar Expansion: Covering Definitions • A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. DO I = 1, 100 S1 T = X(I) S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) S2 Y(I) = T ENDIF ENDDO covering covering

  31. Scalar Expansion: Covering Definitions • A covering definition does not always exist: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO • In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop

  32. Scalar Expansion: Covering Definitions • We will consider a collection of covering definitions • There is a collection C of covering definitions for T in a loop if either: • There exists no -function at the beginning of the loop that merges versions of T from outside the loop with versions defined in the loop, or, • The -function within the loop has no SSA edge to any -function including itself

  33. Scalar Expansion: Covering Definitions • Remember the loop which had no covering definition: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO • To form a collection of covering definitions, we can insert dummy assignments: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = T ENDIF S3 Y(I) = T ENDDO

  34. Scalar Expansion: Covering Definitions • Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: • Central idea: Look for parallel paths to a -function following the first assignment, until no more exist

  35. Scalar Expansion: Covering Definitions Detailed algorithm: • Let S0 be the -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack. • Let S1 be the first definition of T in the loop. Add S1 to C. • If the SSA successor of S1 is a -function S2 that is not equal to S0, then push S2 onto the stack and mark it; • While the stack is non-empty, • pop the -function S from the stack; • add all SSA predecessors that are not -functions to C; • if there is an SSA edge from S0 into S, then insert the assignment T = T as the last statement along that edge and add it to C; • for each unmarked -function S3 (other than S0) that is an SSA predecessor of S, mark S3 and push it onto the stack; • for each unmarked -function S4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S4 and push it onto the stack.

  36. Scalar Expansion: Covering Definitions Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: • Create an array T$ of appropriate length • For each S in the covering definition collection C, replace the T on the left-hand side by T$(I). • For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I). • For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1). • If S0 is not null, then insert T$(0) = T before the loop. • If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.

  37. Scalar Expansion: Covering Definitions After inserting covering definitions: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = T ENDIF S3 Y(I) = T ENDDO After scalar expansion: T$(0) = T DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIF S2 Y(I) = T$(I) ENDDO

  38. Deletable Dependences • Uses of T before covering definitions are expanded as T$(I - 1) • All other uses are expanded as T$(I) then the deletable dependences are: • Backward carried antidependences • Backward carried output dependences • Forward carried output dependences • Loop-independent antidependences into the covering definition • Loop-carried true dependences from a covering definition

  39. Scalar Expansion procedure try_recurrence_breaking(i, D, k) if k is the deepest loop in ithen begin remove deletable edges in i; find the set {SC1, SC2, ..., SCn} of maximal strongly-connected regions in D restricted to i; if there are vector statements among SCithen expand scalars indicated by deletable edges; codegen(i, k, D restricted to i); end try_recurrence_breaking

  40. Scalar Expansion: Drawbacks • Expansion increases memory requirements • Solutions: • Expand in a single loop • Strip mine loop before expansion • Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO

  41. Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO • Renaming scalar T: DO I = 1, 100 S1T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO

  42. Scalar Renaming • will lead to: S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

  43. Scalar Renaming • Renaming algorithm partitions all definitions and uses into equivalent classes, each of which can occupy different memory locations: • Use the definition-use graph to: • Pick definition • Add all uses that the definition reaches to the equivalence class • Add all definitions that reach any of the uses… • ..until fixed point is reached

  44. Scalar Renaming: Profitability • Scalar renaming will break recurrences in which a loop-independent output dependence or antidependence is a critical element of a cycle • Relatively cheap to use scalar renaming • Usually done by compilers when calculating live ranges for register allocation

  45. Array Renaming DO I = 1, N S1 A(I) = A(I-1) + X S2 Y(I) = A(I) + Z S3 A(I) = B(I) + C ENDDO • S1  S2 S2 -1 S3 S3 1 S1 S1 0 S3 • Rename A(I) to A$(I): DO I = 1, N S1 A$(I) = A(I-1) + X S2 Y(I) = A$(I) + Z S3 A(I) = B(I) + C ENDDO • Dependences remaining: S1  S2 andS3 1 S1

  46. Array Renaming: Profitability • Examining dependence graph and determining minimum set of critical edges to break a recurrence is NP-complete! • Solution: determine edges that are removed by array renaming and analyze effects on dependence graph • procedure array_partition: • Assumes no control flow in loop body • identifies collections of references to arrays which refer to the same value • identifies deletable output dependences and antidependences • Use this procedure to generate code • Minimize amount of copying back to the “original” array at the beginning and the end

More Related