1 / 18

Register Pressure Guided Unroll-and-Jam

Author: Yin Ma Steven Carr. Register Pressure Guided Unroll-and-Jam. Motivation. In a processor, register sits at the fastest position in the memory hierarchy, but the number of physical registers is very limited.

ismael
Download Presentation

Register Pressure Guided Unroll-and-Jam

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Author: Yin Ma Steven Carr Register Pressure Guided Unroll-and-Jam

  2. Motivation In a processor, register sits at the fastest position in the memory hierarchy, but the number of physical registers is very limited. Unroll-and-jam in the loop model of Open64 not only increases register pressure by itself but also creates opportunities to make other loop optimizations increase register pressure indirectly. If a transformed loop demands too many registers, the overall performance may degrade Given a loop nest, with a better register pressure prediction and an unroll factor, the degradation can be eliminated and a better overall performance can be achieved

  3. Research Topic A register pressure prediction algorithm for unroll-and-jam A register pressure guided loop model for unroll-and-jam

  4. BackgroundData Dependence Analysis • True Dependence • S1 L1=……. • S2 …….=L2 • Anti-Dependence • S1 …….=L1 • S2 L2=……. • Output Dependence • S1 L1=……. • S2 L2=……. • Input Dependence • S1 …….=L1 • S2 …….=L2 • The data dependence graph (DDG) is a directed graph that represents the data dependence relationship among instructions. • A true dependence exists when L1 stores into a memory location that is read by L2 later. • An anti-dependence exists if L1 is a read from a memory location that is written by L2 later. • An output dependence existswhen L1 and L2 store into the same memory location. • An input dependence exists if a memory location is read by L1 and L2.

  5. BackgroundScalar Replacement Uses scalars, later allocated to registers to replace array references in order to decrease the number of memory references in loops This directly increases register pressure for ( i = 2; i < n; i++ ) a[i] = a[i-1] + b[i]; Scalar Replaced: T = a[1]; for ( i = 2; i < n; i++){ T = T + b[i]; a[i] = T; }

  6. BackgroundUnroll-and-Jam Create larger loop bodies by flattening multiple iterations Larger loop bodies makes other optimizations create more register pressure Unroll-and-jammed and later scalar replaced for ( I = 1 ; I < 10 ; I = I+2 ){ for ( J = 1; J < 5 ; J ++ ){ b = B[J]; c = C[J] A[I][J] = b + c; D[I][J] = E[I][J] + F[I][J]; A[I+1][J] = b + c; D[I+1][J] = E[I+1][J] + F[I+1][J]; } /* register pressure increased because } b, c hold two registers that originally can be reused for E and F */ for ( I = 1 ; I < 10 ; I ++ ){ for ( J = 1; J < 5 ; J ++ ){ A[I][J] = B[J] + C[J]; D[I][J] = E[I][J] + F[I][J]; } } 

  7. BackgroundSoftware Pipelining Software pipelining is an advanced scheduling techniques. Usually, more-overlapped instructions demand additional registers The Initiation interval (II)of a loop is the number of cycles used to finish one iteration. The resource II (ResII) gives the minimum number of cycles needed to execute the loop based upon machine resources such as the number of functional units. The recurrence II (RecII) gives the minimum number of cycles needed for a single iteration based upon the length of the cycles in the data dependence graph. Do N times • [Prelude] • D1 • B1 D2 • [Loop Body] • Do N-2 times (with index i)‏ • Ai Ci Bi+1 Di+2 • [Postlude] • AN-1 CN-1 BN • AN CN Software pipelined due to dependences among the operations

  8. Typical approaches of preventing degradation from register pressure Predictive approach <- Our approaches Predict effects before applying optimizations and decide the best set of parameters to do optimizations Fastest speed and fit for all situations Iterative approach (like feedback based)‏ Apply optimizations with one set of parameters then redo for the better performance with adjusted parameters Genetic approach Prepare many sets of parameters and apply optimizations with each set. Use genetic programming to pick the best

  9. Problem in Previous Work All predictive register prediction methods are designed for software pipelining. Do not support source-code-level loop optimizations at all No systemic research on how to predict register pressure for loop optimizations No register pressure guided loop model

  10. Key Design Detail Prediction algorithms works on source-code level Prediction algorithms handle the effects on register pressure from: unroll-and-jam scalar replacement software pipelining general scalar optimizations Register pressure guided loop model uses the predicted register information to pick an unroll vector for the best performance

  11. Register Prediction for unroll-and-jam (Overview)‏ • Compute RecII with our heuristic method • Create the list of arrays that will be replaced by scalars by checking the original DDG • Constructing the new DDG D1 with the list above only for the original loop • All copies will reuse the DDG D1 as the base DDGs • Adjust each copy of DDGs to reflect the future changes. • Re-compute the ResII to get MinII • Do pseudo schedule to get the register pressure

  12. Construct the base DDG Travel through the innermost loop and construct the base DDG DO J = 1, N DO I = 1, N U(I,J) = V(I) + P(J,I) ENDDOENDDO

  13. Prepare the DDG after unroll-and-jam Duplicate the base DDG with the inputted unroll factors DO J = 1, N DO I = 1, N U(I,J) = V(I) + P(J,I) U(I,J+1) = V(I) + P(J+1,I) ENDDOENDDO Unroll vector is 2

  14. Finalize the DDG Remove unnecessary nodes/edges and add new edges Based on the updated dependence Reflect the effect of further optimizations Consider array indexing reuse by analyzing array subscripts

  15. Register Prediction Schedule the final DDG with a depth-first scan starting from the first node of the first iteration copy The RecII is the RecII of the original innermost loop The ResII is computed on the final DDG with the targeted architecture information MinII = MAX( RecII, ResII)‏

  16. Register Pressure Guided Unroll-and-Jam • Use unitII as the performance indicator of an unroll-and-jammed loop • R is the number of registers predicted • P is the number of registers available • D is the total outgoing degree in the final DDG • E is the total number of cross iteration edges • A is the average memory access penalty • N is the number of nodes in the final DDG

  17. Open64 Implementation & Experiment Results For register prediction, a retargetable compiler with infinite number of available physical registers is used Loop nests are extracted from SPEC2000 For register pressure guided unroll-and-jam, our model directly replaces the unroll-and-jam analysis used by Open64 backend An minor value computed with the information from Open64's cache model is added to UnitII For register prediction for unroll-and-jam, it predicts the floating-point register pressure of a loop within 3 registers and integer register pressure within 4 registers Also our register pressure guided unroll-and-jam improves the overall performance about 2% over the model in the Open64 backend on both x86 and x86-64 architectures on Polyhedron benchmark

  18. The End Any Question?

More Related