Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim

Overview • Motivation • Loop Interchange • Scalar Expansion • Scalar and Array Renaming

Motivational Example • Matrix Multiplication: DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO • Codegen: tries to find parallelism using transformations of loop distribution and statement reordering • codegen will not uncover any vector operations because of the use of the temporary scalar T which creates several dependences carried by each of the loops.

Motivational Example I Replacing the scalar T with the vector T$ will eliminate most of the dependences: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

T$(1:N) = 0.0 C(1:N,J) = T$(1:N) Motivational Example II Distribution of the I-loop gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO

Motivational Example III Interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO I = 1,N DO K = 1,L DO K = 1,L DO I = 1,N T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO C(1:N,J) = T$(1:N) ENDDO

Motivational Example IV Finally, Vectorizing the I-Loop: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L DO I = 1,N T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO

Motivational Example - Summary We used two new transformations: • Loop interchange (replacing I,K loops) • Scalar Expansion (T => T$)

Loop Interchange Loop Interchange - Switching the nesting order of two loops DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B ENDDO ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B ENDDO ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO Impossible to vectorize since there is a cyclic dependence from S to itself, carriedby the J-Loop.

S(1,4) S(2,4) S(3,4) S4,4) J = 4 S(1,3) S(2,3) S(3,3) S(4,3) J = 3 S(1,2) S(2,2) S(3,2) S(4,2) J = 2 S(1,1) S(2,1) S(3,1) S(4,1) J = 1 I = 1 I = 2 I = 3 I = 4 Loop Interchange: Safety • Safety: not all loop interchanges are safe. • S(I,J) – the instance of statement S with the parameters (iteration numbers) I and J DO J = 1, M DO I = 1, N S: A(I,J+1) = A(I+1,J) ENDDO ENDDO A(2,2) is assigned at S(2,1)and used at S(1,2).

Loop Interchange: Safety • A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Distance Vector - Reminder • Consider a dependence in a loop nest of n loops • Statement S1 on iteration i is the source of the dependence • Statement S2 on iteration j is the sink of the dependence • The distance vector is a vector of length n d(i,j) such that: d(i,j)k = jk – ik

“<” ifd(i,j)k > 0 • “=” if d(i,j)k = 0 • “>” if d(i,j)k < 0 D(i,j)k = Direction Vector - Reminder Suppose that there is a dependence from statement S1 on iterationi of a loop nest of n loops andstatement S2 on iteration j, then the dependence direction vector is D(i,j) is defined as a vector of length n such that

Direction Vector - Reminder DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • (I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=) • (I+1,J+1,K) - (I,J+1,K+1) =(1,0,-1) => (<,=,>)

Direction Vector - Reminder • We say that the I’th loop carries a dependence with direction-vector d, if the leftmost non-”=“ in d is in the I’th place. (<,<,=) (<,=,>) • It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence.

Loop Interchange: Safety • Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops is determined by applying the same permutation to the elements of D(i,j). DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • (I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=) => (<,=,<)

< < = < = > Loop Interchange: Safety • The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row. DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO • (1,1,0) => (<,<,=) • (1,0,-1) => (<,=,>)

i j k < < = < = > j i k < < = = < > j k i < = < = > < Loop Interchange: Safety • Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non- "=" direction in any row.

I J = < < < J I < = < < Loop Interchange : profitability In general, our goal is to enhance vectorization by increasing the number of innermost loops that do not carry any dependence: DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + A(0:N-1,J) + B ENDDO

i j k < < = Loop Interchange: profitability • Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO • For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO

A(I,1:M,1:L) j k i < = < Loop Interchange: profitability • Since Fortran stores in column-major order we would like to vectorize theI-loop • Thus, after Interchanging the I-loop to be the inner loop, we can transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO

k j i = < < Loop Interchange: profitability • MIMD machines with vector execution units want to cut down synchronization costs, therefore, shifting the K-loop to outermost level will get: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

< = < = = = = < > < < = = = = = < < = = = = = < < = = < = = = < < > < = = = < = = < = = = = = < Loop Shifting • Motivation: Identify loops which can be moved and move them to “optimal” nesting levels. • Trying all the permutations is not practical. • Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position.

I J K = = < K I J < = = A(1:N,1:N,k) = A(1:N,1:N,k+1) Loop Shifting • Heuristic – move the loops that carry no dependence to the innermost position: DO I = 1, N DO J = 1, N DO K = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO DO K= 1, N DO I = 1, N DO J = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO

I J < < = < J I < < < = Loop Shifting • The heuristic fail in: DO I = 1, N DO J = 1, M A(I+1,J+1) = A(I,J) + A(I+1,J) ENDDO ENDDO • But we still want to get: DO J = 1, M A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J) ENDDO

Loop Shifting • A recursive heuristec: • If the outermost loop carries no dependence, then let the outermost loop that carries a dependence to be the outermost loop. (the former heuristic.) • If the outermost loop carries a dependence, find the outermost loop L that: • L can be safely shifted to the outermost position • L carries a dependence d whose direction vector contains an "=" in every position but the Lth. And move L to the outermost position.

< < = = = = < < = < = = = = = < < < = = = = < < < = = = = = = < < = < = = < = < < = = = = = = < < = = < = < < = < = = = = = < = Loop Shifting

Scalar Expansion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO • Scalar Expansion : DO I = 1, N T$(I) = A(I) A(I) = B(I) B(I) = T$(I) ENDDO T = T$(N) • leads to: T$(1:N) = A(1:N) A(1:N) = B(1:N) B(1:N) = T$(1:N) T = T$(N)

Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I-1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N T$(I) = T$(I-1) + A(I) + A(I-1) A(I) = T$(I) ENDDO T = T$(N)

Scalar Expansion – How to • Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. • Dependences due to reuse of memory location vs. reuse of values: • Dependences due to reuse of values must be preserved • Dependences due to reuse of memory location can be deleted by expansion • We will try to find the removable dependence

X= X= X= =X =X =X Reminder - SSA

T=X(I) Y(I)=T Covering definition • A definition X of a scalar S is a covering definition for loop L if a definition of S placed at the beginning of L reaches no uses of S that occur past X. DO I = 1, 100 S1 T = X(I) S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) S2 Y(I) = T ENDIF ENDDO

T=X(I) Y(I)=T Covering definition • A covering definition does not always exist: DO I = 1, 100 IF (A(I) .GT. 0) THEN S2 T = X(I) ENDIF S3 Y(I) = T ENDDO • In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop

T=-X(I) T=X(I) Y(I)=T Covering definition • Expanded definition of covering definition: A collection C of covering definitions for T in a loop in which every control path has a covering definition. DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = -X(I) ENDIF S3 Y(I) = T ENDDO

Find Covering definitions DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ENDIF S2 Y(I) = T ENDDO • To form a collection of covering definitions, we can insert dummy assignments: DO I = 1, 100 IF (A(I) .GT. 0) THEN S1 T = X(I) ELSE S2 T = T ENDIF S3 Y(I) = T ENDDO

T=X(I) Y(I)=T T=X(I) T=T Y(I)=T Find Covering definition • Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: • Central idea: Look for parallel paths to a -function following the first assignment, and add dummy assignments. DO I = 1, 100 IF (A(I) .GT. 0) THEN T = X(I) ELSE T = T ENDIF Y(I) = T ENDDO Detailed algorithm: • Let S0 be the -function for T at the beginning of the loop, if there is one, and null otherwise. Make C empty and initialize an empty stack. • Let S1 be the first definition of T in the loop. Add S1 to C. • If the SSA successor of S1 is a -function S2 that is not equal to S0, then push S2 onto the stack and mark it; • While the stack is non-empty, • pop the -function S from the stack; • add all SSA predecessors that are not -functions to C; • if there is an SSA edge from S0 into S, then insert the assignment T = T as the last statement along that edge and add it to C; • for each unmarked -function S3 (other than S0) that is an SSA predecessor of S, mark S3 and push it onto the stack; • for each unmarked -function S4 that can be reached from S by a single SSA edge and which is not predominated by S in the control flow graph mark S4 and push it onto the stack.

Scalar Expansion Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: • Create an array T$ of appropriate length • For each S in the covering definition collection C, replace the T on the left-hand side by T$(I). • For every other definition of T and every use of T in the loop body reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I). • For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1). • If S0 is not null, then insert T$(0) = T before the loop. • If there is an SSA edge from any definition in the loop to a use outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound. T$(0) = T DO I = 1, 100 IF (A(I) .GT. 0) THEN S1T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIF S2 Y(I) = T$(I) ENDDO

Deletable dependence: • Given the covering Definitions, we can identify deletable dependence: • Backward carried antidependences: • Writing to T[i] after reading from T[i-1] at the previous iteration • Backward carried output dependences • Writing to T[i] after writing to T[i-1] at the previous iteration • Forward carried output dependences • Writing to T[i] and writing to T[i+1] at the next iteration • Loop-independent antidependences into the covering definition • Writing to T[i] after reading from T[i-1] at the same iteration • Loop-carried true dependences from a covering definition • reading from T[i] after writing to T[i-1] at the previous iteration

DO I = 1, N,64 • DO J = 0,63 • T = A(I+J) +A(I+J+1) • A(I) = T + B(I+J) • ENDDO • ENDDO • DO I = 1, N • T = A(I) + A(I+1) • A(I) = T + B(I) • ENDDO • DO I = 1, N • T = A(I) + A(I+1) • A(I) = T + B(I) • ENDDO • DO I = 1, N • A(I) = A(I) + A(I+1) + B(I) • ENDDO Scalar Expansion: Drawbacks • Expansion increases memory requirements • Solutions: • Expand in a single loop • Strip mine loop before expansion: • Forward substitution:

Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO DO I = 1, 100 S1 T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3 T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

Scalar Renaming • Renaming algorithm partitions all definitions and uses into equivalent classes using the definition-use graph: • Pick definition • Add all uses that the definition reaches to the equivalence class • Add all definitions that reach any of the uses… • ..until fixed point is reached

Scalar Renaming: Profitability • Scalar renaming will remove a loop-independent output dependence or antidependence. • (Writing to the same scalar at the same iteration or writing to the scalar after reading from it.) • Relatively cheap to use scalar renaming. • Usually done by compilers when calculating live ranges for register allocation.

Array Renaming DO I = 1, N S1 A(I) = A(I-1) + X S2 Y(I) = A(I) + Z S3 A(I) = B(I) + C ENDDO • Rename A(I) to A$(I): DO I = 1, N S1 A$(I) = A(I-1) + X S2 Y(I) = A$(I) + Z S3 A(I) = B(I) + C ENDDO A(1:N) =B(1:N) +C A$(1:N)=A(0:N-1)+X Y(1:N) =A$(1:N) +Z

Array Renaming: Profitability • determine edges that are removed by array renaming and analyze effects on dependence graph • procedure array_partition: • Assumes no control flow in loop body • identifies collections of references to arrays which refer to the same value • identifies deletable output dependences and antidependences • Use this procedure to generate code • Minimize amount of copying back to the “original” array at the beginning and the end

Enhancing Fine-Grained Parallelism