1 / 43

Compiler Improvement of Register Usage

Compiler Improvement of Register Usage. Chapter 8, through Section 8.4. Overview. Improving memory hierarchy performance by compiler transformations Scalar Replacement Unroll-and-Jam Saving memory loads & stores Make good use of the processor registers. DO I = 1, N DO J = 1, M

risa
Download Presentation

Compiler Improvement of Register Usage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Improvement of Register Usage Chapter 8, through Section 8.4

  2. Overview • Improving memory hierarchy performance by compiler transformations • Scalar Replacement • Unroll-and-Jam • Saving memory loads & stores • Make good use of the processor registers

  3. DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO A(I) can be left in a register throughout the inner loop Coloring based register allocation fails to recognize this DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO All loads and stores to A in the inner loop have been saved High chance of T being allocated a register by the coloring algorithm Motivating Example

  4. Scalar Replacement • Convert array reference to scalar reference to improve performance of the coloring based allocator • Our approach is to use dependences to achieve these memory hierarchy transformations

  5. Dependence and Memory Hierarchy • True or Flow - save loads and cache miss • Anti - save cache miss • Output - save stores • Input - save loads A(I) = ... + B(I) ... = A(I) + k A(I) = ... ... = B(I)

  6. Dependence and Memory Hierarchy • Loop Carried dependences - Consistent dependences most useful for memory management purposes • Consistent dependences - dependences with constant threshold (dependence distance)

  7. Dependence and Memory Hierarchy • Problem of overcounting optimization opportunities. For example S1: A(I) = ... S2: ... = A(I) S3: ... = A(I) • But we can save only two memory references not three • Solution - Prune edges from dependence graph which don’t correspond to savings in memory accesses

  8. In the reduction example DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO True dependence - replace the references to A in the inner loop by scalar T Output dependence - store can be moved outside the inner loop Antidependence - load can be moved before the inner loop Using Dependences

  9. Example: Scalar Replacement in case of loop independent dependence DO I = 1, N A(I) = B(I) + C X(I) = A(I)*Q ENDDO DO I = 1, N t = B(I) + C A(I) = t X(I) = t*Q ENDDO One less load for each iteration for reference to A Scalar Replacement

  10. Example: Scalar Replacement in case of loop carried dependence spanning single iteration DO I = 1, N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO tB = B(0) DO I = 1, N tA = tB A(I) = tA tB = tA + C(I) B(I) = tB ENDDO One less load for each iteration for reference to B which had a loop carried true dependence spanning 1 iteration Also one less load per iteration for reference to A Scalar Replacement

  11. Example: Scalar Replacement in case of loop carried dependence spanning multiple iterations DO I = 1, N A(I) = B(I-1) + B(I+1) ENDDO t1 = B(0) t2 = B(1) DO I = 1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO One less load for each iteration for reference to B which had a loop carried input dependence spanning 2 iterations Invariants maintained were t1=B(I-1);t2=B(I);t3=B(I+1) Scalar Replacement

  12. t1 = B(0) t2 = B(1) DO I = 1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO Unnecessary register-register copies Unroll loop 3 times t1 = B(0) t2 = B(1) mN3 = MOD(N,3) DO I = 1, mN3 t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO DO I = mN3 + 1, N, 3 t3 = B(I+1) A(I) = t1 + t3 t1 = B(I+2) A(I+1) = t2 + t1 t2 = B(I+3) A(I+2) = t3 + t2 ENDDO Eliminate Scalar Copies Preloop Main Loop

  13. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Dependence pattern before pruning Not all edges suggest memory access savings Pruning the dependence graph

  14. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Dependence pattern before pruning Not all edges suggest memory access savings DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Dashed edges are pruned Red-colored array references are generators Each reference has at most one predecessor in the pruned graph Pruning the dependence graph

  15. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Apply scalar replacement after pruning the dependence graph t0A = A(0); t1A0 = A(1); tB1 = B(0) DO I = 1, N t1A1 = t0A + tB1 tB3 = B(I+1) t0A = t1A0 + tB3 + tB2 A(I) = t0A t1A0 = t1A1 tB1 = tB2 tB2 = tB3 ENDDO A(N+1) = t1A1 Only one load and one store per iteration Pruning the dependence graph

  16. Pruning the dependence graph • Prune flow and input dependence edges that do not represent a potential reuse • Prune redundant input dependence edges • Prune output dependence edges after rest of the pruning is done

  17. Pruning the dependence graph • Phase 1: Eliminate killed dependences • When killed dependence is a flow dependence S1: A(I+1) = ... S2: A(I) = ... S3: ... = A(I) • Store in S2 is a killing store. Flow dependence from S1 to S3 is pruned • When killed dependence is an input dependence S1: ... = A(I+1) S2: A(I) = ... S3: ... = A(I-1) • Store in S2 is a killing store. Input dependence from S1 to S3 is pruned

  18. Pruning the dependence graph • Phase 2: Identify generators DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO • Any assignment reference with at least one flow dependence emanating from it to another statement in the loop • Any use reference with at least one input dependence emanating from it and no input or flow dependence into it

  19. Pruning the dependence graph • Phase 3: Find name partitions and eliminate input dependences • Use Typed Fusion • References as vertices • An edge joins two references • Output and anti- dependences are bad edges • Name of array as type • Eliminate input dependences between two elements of same name partition unless source is a generator

  20. Pruning the dependence graph • Special cases • Reference is in a dependence cycle in the loop DO I = 1, N A(J) = B(I) + C(I,J) C(I,J) = A(J) + D(I) ENDDO • Assign single scalar to the reference in the cycle • Replace A(J) by a scalar tA and insert A(J)=tA before or after the loop depending on upward/downward exposed occurrence

  21. Special cases: Inconsistent dependences DO I = 1, N A(I) = A(I-1) + B(I) A(J) = A(J) + A(I) ENDDO Store to A(J) kills A(I) Only one scalar replacement possible DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO This code can be improved substantially by index set splitting Pruning the dependence graph

  22. DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO Split this loop into three separate parts A loop up to J Iteration J A loop after iteration J to N tAI = A(0); tAJ = A(J) JU = MAX(J-1,0) DO I = 1, JU tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO IF(J.GT.0.AND.J.LE.N) THEN tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI tAI = tAJ ENDIF DO I = JU+2, N tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO A(J) = tAJ Pruning the dependence graph

  23. Moderation of Register Pressure • Scalar replacement of all name partitions will produce a lot of scalar quantities which compete for floating point registers • Have to choose name partitions for scalar replacement to maximize register usage

  24. Moderation of Register Pressure • Attach two parameters to each name partition R • v(R): the value of the name partition R • Number of loads or stores saved by replacing each reference in R by register resident scalars • c(R): the cost of the name partition R • Number of registers needed to hold all scalar temporary values

  25. Moderation of Register Pressure • Choose subset {R1,…,Rm} such that and maximize • 0-1 bin packing problem • Can use dynamic programming: O(nm) • Can use heuristic • Order name partitions according to the ratio v(R)/c(R) and select elements at the beginning of the list till all registers are exhausted

  26. Scalar Replacement: Putting it together 1. Prune dependence graph; Apply typed fusion 2. Select a set of name partitions using register pressure moderation 3. For each selected partition A) If non-cyclic, replace using set of temporaries B) If cyclic replace reference with single temporary C) For each inconsistent dependence Use index set splitting or insert loads and stores 4. Unroll loop to eliminate scalar copies

  27. DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO t0A = A(0); t1A0 = A(1); tB1 = B(0) DO I = 1, N t1A1 = t0A + tB1 tB3 = B(I+1) t0A = t1A0 + tB3 + tB2 A(I) = t0A t1A0 = t1A1 tB1 = tB2 tB2 = tB3 ENDDO A(N+1) = t1A1 Scalar Replacement: Case A

  28. Scalar Replacement: Case B DO I = 1, N A(J) = B(I) + C(I,J) C(I,J) = A(J) + D(I) ENDDO • replace with single temporary... DO I = 1, N tA = B(I) + C(I,J) C(I,J) = tA + D(I) ENDDO A(J) = tA

  29. DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO Split this loop into three separate parts A loop up to J Iteration J A loop after iteration J to N tAI = A(0); tAJ = A(J) JU = MAX(J-1,0) DO I = 1, JU tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO IF(J.GT.0.AND.J.LE.N) THEN tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI tAI = tAJ ENDIF DO I = JU+2, N tAI = tAI + B(I); A(I) = tAI tAJ = tAJ + tAI ENDDO A(J) = tAJ Scalar Replacement: Case C

  30. Experiments on Scalar Replacement

  31. Experiments on Scalar Replacement

  32. DO I = 1, N*2 DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO Can we achieve reuse of references to B ? Use transformation called Unroll-and-Jam DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO ENDDO Unroll outer loop twice and then fuse the copies of the inner loop Brought two uses of B(J) together Unroll-and-Jam

  33. DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO ENDDO Apply scalar replacement on this code DO I = 1, N*2, 2 s0 = A(I) s1 = A(I+1) DO J = 1, M t = B(J) s0 = s0 + t s1 = s1 + t ENDDO A(I) = s0 A(I+1) = s1 ENDDO Half the number of loads as the original program Unroll-and-Jam

  34. Is unroll-and-jam always legal? DO I = 1, N*2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) ENDDO ENDDO Apply unroll-and-jam DO I = 1, N*2, 2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) A(I+2,J-1) = A(I+1,J) + B(I+1,J) ENDDO ENDDO This is wrong!!! Legality of Unroll-and-Jam

  35. Legality of Unroll-and-Jam

  36. Legality of Unroll-and-Jam • Direction vector in this example was (<,>) • This makes loop interchange illegal • Unroll-and-Jam is loop interchange followed by unrolling inner loop followed by another loop interchange • But does loop interchange illegal imply unroll-and-jam illegal ? NO

  37. Consider this example DO I = 1, N*2 DO J = 1, M A(I+2,J-1) = A(I,J) + B(I,J) ENDDO ENDDO Direction vector is (<,>); still unroll-and-jam possible Legality of Unroll-and-Jam

  38. Conditions for legality of unroll-and-jam • Definition: Unroll-and-jam to factor n consists of unrolling the outer loop n-1 times and fusing those copies together. • Theorem: An unroll-and-jam to a factor of n is legal iff there exists no dependence with direction vector (<,>) such that the distance for the outer loop is less than n.

  39. Unroll-and-jam Algorithm 1. Create preloop 2. Unroll main loop m(the unroll-and-jam factor) times 3. Apply typed fusion to loops within the body of the unrolled loop 4. Apply unroll-and-jam recursively to the inner nested loop

  40. DO I = 1, N DO K = 1, N A(I) = A(I) + X(I,K) ENDDO DO J = 1, M DO K = 1, N B(J,K) = B(J,K) + A(I) ENDDO ENDDO DO J = 1, M C(J,I) = B(J,N)/A(I) ENDDO ENDDO DO I = mN2+1, N, 2 DO K = 1, N A(I) = A(I) + X(I,K) A(I+1) = A(I+1) + X(I+1,K) ENDDO DO J = 1, M DO K = 1, N B(J,K) = B(J,K) + A(I) B(J,K) = B(J,K) + A(I+1) ENDDO C(J,I) = B(J,N)/A(I) C(J,I+1) = B(J,N)/A(I+1) ENDDO ENDDO Unroll-and-jam example

  41. Unroll-and-jam: Experiments

  42. Unroll-and-jam: Experiments

  43. Conclusion • We have learned two memory hierarchy transformations: • scalar replacement • unroll-and-jam • They reduce the number of memory accesses by maximum use of processor registers

More Related