510 likes | 523 Views
Learn about loop mangling techniques including vectorization, instruction-level parallelism, and formalism based on dependence analysis for optimizing program performance. Understand safe transformations, reordering techniques, and theorem of dependence. Explore loop distribution, fine-grained parallelism, and optimization strategies like loop interchange and scalar expansion. Discover array renaming and advanced transformations for parallel program execution.
E N D
Loop Mangling With both thanks and apologies to Allen and Kennedy
Preliminaries • Want to parallelize programs, so • Concentrate on loops • Types of parallelism • Vectorization • Instruction-level parallelism • Courser grain parallelism • Shared memory • Distributed memory • Message passing • Formalism based upon dependence analysis
Transformations • We call a transformation safe if the transformed program has the same "meaning" as the original program • But, what is the "meaning" of a program? For our purposes: • Two computations are equivalent if, on the same inputs: • They produce the same outputs in the same order
Reordering Transformations • A reordering transformation is any program transformation that merely changes the order of execution of the code, without adding or deleting any executions of any statements
Properties of Reordering Transformations • A reordering transformation does not eliminate dependences • However, it can change the ordering of the dependence which will lead to incorrect behavior • A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence.
Fundamental Theorem of Dependence • Fundamental Theorem of Dependence: • Any reordering transformation that preserves every dependence in a program preserves the meaning of that program • Proof by contradiction. Theorem 2.2 in the Allen and Kennedy text
Fundamental Theorem of Dependence • A transformation is said to be valid for the program to which it applies if it preserves all dependences in the program.
Parallelization and Vectorization • It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence. • Want to convert loops like: DO I=1,N X(I) = X(I) + C ENDDO • toX(1:N) = X(1:N) + C(Fortran 77 to Fortran 90) • However: DO I=1,N X(I+1) = X(I) + C ENDDO is not equivalent to X(2:N+1) = X(1:N) + C
Loop Distribution • Can statements in loops which carry dependences be vectorized? D0 I = 1, N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO • Dependence: S1 1S2can be converted to: S1 A(2:N+1) = B(1:N) + C S2 D(1:N) = A(1:N) + E
Loop Distribution DO I = 1, N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO • transformed to: • DO I = 1, N • S1 A(I+1) = B(I) + C • ENDDO • DO I = 1, N • S2 D(I) = A(I) + E • ENDDO • leads to: • S1 A(2:N+1) = B(1:N) + C • S2 D(1:N) = A(1:N) + E
Loop Distribution • Loop distribution fails if there is a cycle of dependences DO I = 1, N S1 A(I+1) = B(I) + C S2 B(I+1) = A(I) + E ENDDO S11 S2 and S21 S1 • What about: DO I = 1, N S1 B(I) = A(I) + E S2 A(I+1) = B(I) + C ENDDO
Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: • Loop Interchange • Scalar Expansion • Scalar Renaming • Array Renaming
Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO
Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO
Motivational Example II • Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO
Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO • A couple of new transformations used: • Loop interchange • Scalar Expansion
Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B •DV: (=, <) ENDDO ENDDO • Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B •DV: (<, =) ENDDO ENDDO • leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO
Scalar Expansion DO I = 1, N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO • Scalar Expansion: DO I = 1, N S1 T$(I) = A(I) S2 A(I) = B(I) S3 B(I) = T$(I) ENDDO T = T$(N) • leads to: S1 T$(1:N) = A(1:N) S2 A(1:N) = B(1:N) S3 B(1:N) = T$(1:N) T = T$(N)
Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N S1 T$(I) = T$(I-1) + A(I) + A(I+1) S2 A(I) = T$(I) ENDDO T = T$(N)
Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO • Renaming scalar T: DO I = 1, 100 S1T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO
Scalar Renaming • will lead to: S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)
Array Renaming DO I = 1, N S1 A(I) = A(I-1) + X S2 Y(I) = A(I) + Z S3 A(I) = B(I) + C ENDDO • S1 S2 S2 -1 S3 S3 1 S1 S1 0 S3 • Rename A(I) to A$(I): DO I = 1, N S1 A$(I) = A(I-1) + X S2 Y(I) = A$(I) + Z S3 A(I) = B(I) + C ENDDO • Dependences remaining: S1 S2 andS3 1 S1
Seen So Far... • Uncovering potential vectorization in loops by • Loop Interchange • Scalar Expansion • Scalar and Array Renaming • Safety and Profitability of these transformations
And Now ... • More transformations • Node Splitting • Recognition of Reductions • Index-Set Splitting • Run-time Symbolic Resolution • Unified framework to generate vector code • Left to the implementer (get a copy of Allen and Kennedy book)
Node Splitting • Sometimes Renaming fails DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • Recurrence kept intact by renaming algorithm
DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Node Splitting
Node Splitting • Determining minimal set of critical antidependences is in NP-C • Perfect job of Node Splitting difficult • Heuristic: • Select antidependences • Delete it to see if acyclic • If acyclic, apply Node Splitting
Recognition of Reductions • Sum Reduction, Min/Max Reduction, Count Reduction • Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO • Not directly vectorizable
Assuming commutativity and associativity S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO Distribute k loop S = 0.0 DO k = 1, 4 SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO Recognition of Reductions
Recognition of Reductions • After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO ENDDO • Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO
Recognition of Reductions • Properties of Reductions • Reduce Vector/Array to one element • No use of Intermediate values • Reduction operates on vector and nothing else
Index-set Splitting • Subdivide loop into different iteration ranges to achieve partial parallelization • Threshold Analysis [Strong SIV, Weak Crossing SIV] • Loop Peeling [Weak Zero SIV] • Section Based Splitting [Variation of loop peeling]
Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO ENDDO Vectorize this
Loop Peeling • Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1)
Run-time Symbolic Resolution • “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF
Run-time Symbolic Resolution • Identifying minimum number of breaking conditions to break a recurrence is in NP-C • Heuristic: • Identify when a critical dependence can be conditionally eliminated via a breaking condition
Putting It All Together • Good Part • Many transformations imply more choices to exploit parallelism • Bad Part • Choosing the right transformation • How to automate transformation selection process? • Interference between transformations
Putting It All Together • Any algorithm which tries to tie all transformations must • Take a global view of transformed code • Know the architecture of the target machine • Goal of our algorithm • Finding ONE good vector loop [works well for most vector register architectures]
Overview • Improving memory hierarchy performance by compiler transformations • Scalar Replacement • Unroll-and-Jam • Saving memory loads & stores • Make good use of the processor registers
DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO A(I) can be left in a register throughout the inner loop Coloring based register allocation fails to recognize this DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO All loads and stores to A in the inner loop have been saved High chance of T being allocated a register by the coloring algorithm Motivating Example
Scalar Replacement • Convert array reference to scalar reference to improve performance of the coloring based allocator • Our approach is to use dependences to achieve these memory hierarchy transformations
DO I = 1, N*2 DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO Can we achieve reuse of references to B ? Use transformation called Unroll-and-Jam DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO ENDDO Unroll outer loop twice and then fuse the copies of the inner loop Brought two uses of B(J) together Unroll-and-Jam
DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO ENDDO Apply scalar replacement on this code DO I = 1, N*2, 2 s0 = A(I) s1 = A(I+1) DO J = 1, M t = B(J) s0 = s0 + t s1 = s1 + t ENDDO A(I) = s0 A(I+1) = s1 ENDDO Half the number of loads as the original program Unroll-and-Jam
Is unroll-and-jam always legal? DO I = 1, N*2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) ENDDO ENDDO Apply unroll-and-jam DO I = 1, N*2, 2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) A(I+2,J-1) = A(I+1,J) + B(I+1,J) ENDDO ENDDO This is wrong!!! Legality of Unroll-and-Jam
Legality of Unroll-and-Jam • Direction vector in this example was (<,>) • This makes loop interchange illegal • Unroll-and-Jam is loop interchange followed by unrolling inner loop followed by another loop interchange • But does loop interchange illegal imply unroll-and-jam illegal ? NO
Conditions for legality of unroll-and-jam • Definition: Unroll-and-jam to factor n consists of unrolling the outer loop n-1 times and fusing those copies together. • Theorem: An unroll-and-jam to a factor of n is legal iff there exists no dependence with direction vector (<,>) such that the distance for the outer loop is less than n.
Conclusion • We have learned two memory hierarchy transformations: • scalar replacement • unroll-and-jam • They reduce the number of memory accesses by maximum use of processor registers
Loop Fusion • Consider the following: DO I = 1,N A(I) = C(I) + D(I) ENDDO DO I = 1,N B(I) = C(I) - D(I) ENDDO If we fuse these loops, we can reuse operations in registers: DO I = 1,N A(I) = C(I) + D(I) B(I) = C(I) - D(I) ENDDO
Example DO J = 1,N DO I = 1,M A(I,J) = C(I,J)+D(I,J) ENDDO DO I = 1,M B(I,J) = A(I,J-1)-E(I,J) ENDDO ENDDO Fusion and Interchange DO I = 1,M DO J = 1,N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDO ENDDO Scalar Replacement We’ve saved (n-1)m loads DO I = 1,M r = A(I,0) DO J = 1,N B(I,J) = r - E(I,J) r = C(I,J) + D(I,J) A(I,J) = r ENDDO ENDDO