180 likes | 311 Views
Author: Yin Ma Steven Carr. Register Pressure Guided Unroll-and-Jam. Motivation. In a processor, register sits at the fastest position in the memory hierarchy, but the number of physical registers is very limited.
E N D
Author: Yin Ma Steven Carr Register Pressure Guided Unroll-and-Jam
Motivation In a processor, register sits at the fastest position in the memory hierarchy, but the number of physical registers is very limited. Unroll-and-jam in the loop model of Open64 not only increases register pressure by itself but also creates opportunities to make other loop optimizations increase register pressure indirectly. If a transformed loop demands too many registers, the overall performance may degrade Given a loop nest, with a better register pressure prediction and an unroll factor, the degradation can be eliminated and a better overall performance can be achieved
Research Topic A register pressure prediction algorithm for unroll-and-jam A register pressure guided loop model for unroll-and-jam
BackgroundData Dependence Analysis • True Dependence • S1 L1=……. • S2 …….=L2 • Anti-Dependence • S1 …….=L1 • S2 L2=……. • Output Dependence • S1 L1=……. • S2 L2=……. • Input Dependence • S1 …….=L1 • S2 …….=L2 • The data dependence graph (DDG) is a directed graph that represents the data dependence relationship among instructions. • A true dependence exists when L1 stores into a memory location that is read by L2 later. • An anti-dependence exists if L1 is a read from a memory location that is written by L2 later. • An output dependence existswhen L1 and L2 store into the same memory location. • An input dependence exists if a memory location is read by L1 and L2.
BackgroundScalar Replacement Uses scalars, later allocated to registers to replace array references in order to decrease the number of memory references in loops This directly increases register pressure for ( i = 2; i < n; i++ ) a[i] = a[i-1] + b[i]; Scalar Replaced: T = a[1]; for ( i = 2; i < n; i++){ T = T + b[i]; a[i] = T; }
BackgroundUnroll-and-Jam Create larger loop bodies by flattening multiple iterations Larger loop bodies makes other optimizations create more register pressure Unroll-and-jammed and later scalar replaced for ( I = 1 ; I < 10 ; I = I+2 ){ for ( J = 1; J < 5 ; J ++ ){ b = B[J]; c = C[J] A[I][J] = b + c; D[I][J] = E[I][J] + F[I][J]; A[I+1][J] = b + c; D[I+1][J] = E[I+1][J] + F[I+1][J]; } /* register pressure increased because } b, c hold two registers that originally can be reused for E and F */ for ( I = 1 ; I < 10 ; I ++ ){ for ( J = 1; J < 5 ; J ++ ){ A[I][J] = B[J] + C[J]; D[I][J] = E[I][J] + F[I][J]; } }
BackgroundSoftware Pipelining Software pipelining is an advanced scheduling techniques. Usually, more-overlapped instructions demand additional registers The Initiation interval (II)of a loop is the number of cycles used to finish one iteration. The resource II (ResII) gives the minimum number of cycles needed to execute the loop based upon machine resources such as the number of functional units. The recurrence II (RecII) gives the minimum number of cycles needed for a single iteration based upon the length of the cycles in the data dependence graph. Do N times • [Prelude] • D1 • B1 D2 • [Loop Body] • Do N-2 times (with index i) • Ai Ci Bi+1 Di+2 • [Postlude] • AN-1 CN-1 BN • AN CN Software pipelined due to dependences among the operations
Typical approaches of preventing degradation from register pressure Predictive approach <- Our approaches Predict effects before applying optimizations and decide the best set of parameters to do optimizations Fastest speed and fit for all situations Iterative approach (like feedback based) Apply optimizations with one set of parameters then redo for the better performance with adjusted parameters Genetic approach Prepare many sets of parameters and apply optimizations with each set. Use genetic programming to pick the best
Problem in Previous Work All predictive register prediction methods are designed for software pipelining. Do not support source-code-level loop optimizations at all No systemic research on how to predict register pressure for loop optimizations No register pressure guided loop model
Key Design Detail Prediction algorithms works on source-code level Prediction algorithms handle the effects on register pressure from: unroll-and-jam scalar replacement software pipelining general scalar optimizations Register pressure guided loop model uses the predicted register information to pick an unroll vector for the best performance
Register Prediction for unroll-and-jam (Overview) • Compute RecII with our heuristic method • Create the list of arrays that will be replaced by scalars by checking the original DDG • Constructing the new DDG D1 with the list above only for the original loop • All copies will reuse the DDG D1 as the base DDGs • Adjust each copy of DDGs to reflect the future changes. • Re-compute the ResII to get MinII • Do pseudo schedule to get the register pressure
Construct the base DDG Travel through the innermost loop and construct the base DDG DO J = 1, N DO I = 1, N U(I,J) = V(I) + P(J,I) ENDDOENDDO
Prepare the DDG after unroll-and-jam Duplicate the base DDG with the inputted unroll factors DO J = 1, N DO I = 1, N U(I,J) = V(I) + P(J,I) U(I,J+1) = V(I) + P(J+1,I) ENDDOENDDO Unroll vector is 2
Finalize the DDG Remove unnecessary nodes/edges and add new edges Based on the updated dependence Reflect the effect of further optimizations Consider array indexing reuse by analyzing array subscripts
Register Prediction Schedule the final DDG with a depth-first scan starting from the first node of the first iteration copy The RecII is the RecII of the original innermost loop The ResII is computed on the final DDG with the targeted architecture information MinII = MAX( RecII, ResII)
Register Pressure Guided Unroll-and-Jam • Use unitII as the performance indicator of an unroll-and-jammed loop • R is the number of registers predicted • P is the number of registers available • D is the total outgoing degree in the final DDG • E is the total number of cross iteration edges • A is the average memory access penalty • N is the number of nodes in the final DDG
Open64 Implementation & Experiment Results For register prediction, a retargetable compiler with infinite number of available physical registers is used Loop nests are extracted from SPEC2000 For register pressure guided unroll-and-jam, our model directly replaces the unroll-and-jam analysis used by Open64 backend An minor value computed with the information from Open64's cache model is added to UnitII For register prediction for unroll-and-jam, it predicts the floating-point register pressure of a loop within 3 registers and integer register pressure within 4 registers Also our register pressure guided unroll-and-jam improves the overall performance about 2% over the model in the Open64 backend on both x86 and x86-64 architectures on Polyhedron benchmark
The End Any Question?