460 likes | 663 Views
Array Dependence Analysis with the Chains of Recurrences Framework for Loop Optimization. Robert van Engelen Florida State University. Also thanks to J. Birch, Y. Shou, and K. Gallivan. Outline. Motivation Restructuring compilers
E N D
Array Dependence Analysis with the Chains of Recurrences Framework for Loop Optimization Robert van Engelen Florida State University Also thanks to J. Birch, Y. Shou, and K. Gallivan NCSU 2/24/06
Outline • Motivation • Restructuring compilers • Chains of recurrences algebra and associated algorithms for the GCC and Polaris compilers • Nonlinear array dependence testing for loop restructuring and vectorization • Experimental results • Conclusions NCSU 2/24/06
Motivation • Intel CTO: “the increased power requirements of newer chips will lead to CPUs that are hotter than the surface of the sun by 2010” • Enter multi-core CPUs • Increase the overall system speed by adding CPU cores • Speed up multi-threaded applications • Can effectively lower the power consumption • Enter (more?) multi-media extensions • Vector-like instruction sets: MMX, SSE, AltiVec • Speed up multi-media codes, such as JPEG, MPEG NCSU 2/24/06
Code Optimization by Hand or Automatic? • Rewriting applications by hand to exploit parallelism is doable, if: • Tasks can be identified that run independently, such as a Web browser’s rendering and communications tasks • Course-grain parallelism: tasks must have sufficient work • Rewriting applications by hand to exploit lots of fine-grain parallelism is not doable • Thousands of read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW), data dependences must be analyzed NCSU 2/24/06
Restructuring Compilers • A restructuring compiler typically applies source-code transformations automatically to meet various performance enhancement criteria: • Exploit parallelism in loops by reordering the loop structure to run loop iterations in parallel • Find small loops to replace with vector instructions • Optimize data locality by reordering code to change memory access order and cache • All code changes are safe as long as RAW, WAR, and WAW data dependences are preserved! NCSU 2/24/06
Example: Loop Fission S1 DO I = 1, 10S2 DO J = 1, 10S3 A(I,J) = B(I,J) + C(I,J)S4 D(I,J) = A(I,J-1) * 2.0S5 ENDDO S6 ENDDO • Loop fission splits a single loop into multiple loops • Allows vectorization and parallelization of the new loops when original loop was sequential • Loop fission must preserve all dependence relations of the original loop S3(=,<)S4 S1 DO I = 1, 10S2 DO J = 1, 10S3 A(I,J) = B(I,J) + C(I,J)Sx ENDDO Sy DO J = 1, 10S4 D(I,J) = A(I,J-1) * 2.0S5 ENDDO S6 ENDDO S3(=,<)S4 S1 PARALLEL DO I = 1, 10S3 A(I,1:10)=B(I,1:10)+C(I,1:10)S4 D(I,1:10)=A(I,0:9) * 2.0S6 ENDDO NCSU 2/24/06 S3(=,<)S4
Loop Fission: Algorithm S1 DO I = 1, 10S2 A(I) = A(I) + B(I-1)S3 B(I) = C(I-1)*X + ZS4 C(I) = 1/B(I)S5 D(I) = sqrt(C(I))S6 ENDDO • Compute the acyclic condensation of the dependence graph to find a legal order of the loops S3(<)S2S4(<)S3 S3(=)S4S4(=)S5 S2 S1 DO I = 1, 10S3 B(I) = C(I-1)*X + ZS4 C(I) = 1/B(I)Sx ENDDO S2 A(1:10) = A(1:10) + B(0:9)S5 D(1:10) = sqrt(C(1:10)) 1 S3 S4 S3 1 0 S2 S5 S4 0 Acyclic condensation S5 NCSU 2/24/06 Dependence graph
Example: Loop Interchange S1 DO I = 1, NS2 DO J = 1, MS3 A(I,J) = A(I,J-1) + B(I,J)S4 ENDDOS5 ENDDO • Changes the loop nesting order • Allows vectorization of an outer loop and more effective parallelization of an inner loop • Can be used to improve spatial locality • Loop interchange must preserve all dependence relations of the original loop S3(=,<)S3 S2 DO J = 1, MS1 DO I = 1, NS3 A(I,J) = A(I,J-1) + B(I,J)S4 ENDDOS5 ENDDO S3(<,=)S3 S2 DO J = 1, MS3 A(1:N,J)=A(1:N,J-1)+B(1:N,J)S5 ENDDO S3(<,=)S3 NCSU 2/24/06
Loop Interchange: Algorithm S1 DO I = 1, NS2 DO J = 1, MS3 DO K = 1, LS4 A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)S5 ENDDOS6 ENDDOS7 ENDDO • Compute the direction matrix and find which columns (and therefore which loops) can be permuted without violating dependence relations in the original loop nest S4(<,<,=)S4S4(<,=,>)S4 < < =< = > < < =< = > < = <= > < Invalid Direction matrix < < =< = > < < == < > Valid NCSU 2/24/06
Complications • Loop restructuring is complicated by: • The presence of several induction variables • Nonlinear and symbolic array index expressions • The use of pointer arithmetic instead of arrays in C • Non-unit loop strides and unstructured loops • Control flow • Need loop normalization and preprocessing • Apply induction variable substitution • Convert pointer dereferences to array accesses • Normalize the loop iteration space NCSU 2/24/06
Induction Variable Substitution IVS Dep test GCD test to solve dependence equation 2id - 2iu = -1 Since 2 does not divide 1 there is no data dependence. A[] … A[2*i+1] A[2*i+2] NCSU 2/24/06
IV Recognitionon SSA Forms [Cytron91, Wolfe92] I1 = 3M1 = 0do I2 = (I1,I3) J1 = (?,J3) K1 = (?,K2) L1 = (?,L2) M2 = (M1,M3) J2 = 3 I3 = I2+1 L2 = M2+1 M3 = L2+2 J3 = I3+J2 K2 = 2*J3while (…) Spanningtree I2(i) = 3+i J1(i) = 7+iL2(i) = 1+3i K1(i) = 14+2iM2(i) = 3i NCSU 2/24/06
Symbolic Differencing [Haghighat95] Use abstract interpretation to evaluate loop iterations and construct symbolic difference table of the IV values do x = x+z y = z+1 z = y+1while (…) x(i) = x0 + z0i+ (i2-i)y(i) = z0 + 2i+ 1z(i) = z0 + 2i NCSU 2/24/06
Pointer-to-Array Conversion [vanEngelen01, Franke01] f += 2;lsp += 2;for (i = 2; i <= 5; i++){ *f = f[-2]; for (j = 1; j < i; j++, f--) *f += f[-2]-2*(*lsp)*f[-1]; *f -= 2*(*lsp);f += i;lsp += 2;} for (i = 0; i <= 3; i++){ f[i+2] = f[i]; for (j = 0; j <= i; j++) f[i-j+2] += f[i-j]- 2*lsp[2*i+2]*f[i-j+1]; f[1] -= 2*lsp[2*i+2];} Lsp_az speech codec segmentfrom ETSI with pointer updates. Lsp_az speech codec segmentafter pointer-to-array conversion.Note that all array indexexpressions are affine. NCSU 2/24/06
Control-Flow Issues • Conditional array accesses and conditionally updated induction variables present problems: for (…) { if (…) A[I] = … else … = A[J]} do { K = 3; K = K+J; if (…) J = K; else J = J+3; A[J] = …} while (J<N) DO I=1,10 IF … J = J+2 ELSE J = I ENDIF A(J) = …ENDDO Assume RAW andWAR dependences Extensive analysisreveals that J:=J+3 Problem: J has nosingle recurrence form NCSU 2/24/06
Chains of Recurrences for Compiler Optimization • Chains of recurrence forms and algebra can be used to: • Detect (non)linear coupled IVs • Analyze pointer arithmetic • Effectively handle control flow • Implement array dependence testing NCSU 2/24/06
Chains of Recurrences • A chain of recurrences (CR) represents a polynomial or exponential function or mix evaluated over a unit-distance grid [Zima92] • Basic form: {init, , stride} NCSU 2/24/06
Chains of Recurrences:General Formulation • The key idea is to represent a non-constant CR stride in CR form itself, thereby forming a chain of recurrences • Example: f(i) = i2 = {0, +, s(i-1)} = {0, +, 1, +, 2} where s(i-1) = {1, +, 2} NCSU 2/24/06
CRs for Expediting Function Evaluations on Grids • Suppose f(i) = a + b·i + c·i2 = {a, +, {b+c, +, 2c}} • We have two IVs x and y:f(i) = x = {x0, +, y} with x0 = as(i) = y = {y0, +, 2c} with y0 = b+c • Implement loop to update x and y for efficient evaluation of f(i) over a unit-distance grid i = 0, …, n : s(i) x = ay = b+cfor i=0 to n f[i] = x x = x+y y = y+2*cendfor NCSU 2/24/06
Let f(i,j) = i2 + i·j + 1 Create IV k for f(i,j) in j-loop:f(i,j) = kj = {pi, +, ri}j with pi = i2 + 1 and ri = i Create IVs for pi and ri in i-loop:pi = {p0, +, qi}i with p0 = 1qi = {q0, +, 2}i with q0 = 1ri = {r0, +, 1}i with r0 = 0 Implement k, p, q, and r ini-j-loop nest Multi-Dimensional Example p = 1q = 1r = 0for i = 0 to n k = p for j = 0 to m f[i,j] = k k = k+r endfor p = p+q q = q+2 r = r+1endfor NCSU 2/24/06
CR Construction with the CR Algebra • To construct the CR form of a symbolic function f(i): • Replace i with CR {0,+,1} • Apply CR algebra rewrite rules (selected rules shown): • Example:f(i) = c·(i+a) = c·({0, +, 1}+a) = c{a, +, 1} = {c·a, +, c} NCSU 2/24/06
Loop Analysis with CR Forms [vanEngelen01] • The basic idea: • Scan the loop to detect IV updates • Construct the CR form for each IV using the CR algebra NCSU 2/24/06
Algorithm 1: Find Recurrences • Input: Loop L with live variable informationOutput: Set S of recurrence relations of IVs • Start with set S = { v, v | v is live at loop header } • Search L from bottom to top:for each assignment v = x of expression x to scalar variable v update tuples u, y in S by replacing v in y with x NCSU 2/24/06
Algorithm 2: Compute CR Forms • Input: Set S with recurrence relationsOutput: CR forms for IVs in S • For each relation v, x in S do:if x is of the form v then v = v0 (v is loop invariant) if x is of the form v + y then v = {v0, +, y}if x is of the form v * y then v = {v0, *, y}if x does not contain v then v = {v0, #, y} (v is wrap around) • Simplify the CR forms with the CR algebra rewrite rules NCSU 2/24/06
Algorithm 3: Solve • Input: CR forms for IVsOutput: Closed-form solutions for IVs (when possible) • For each CR form of v apply the CR inverse algebra, assuming loop is normalized for i = 0, …, n • Certain “exotic” mixed non-polynomial and non-exponential CR forms may not have closed forms NCSU 2/24/06
Example 1 do i=0,2*N-2 A(i*i-i+2) = A(2*i)end do NCSU 2/24/06
Example 2 DO I=1,M DO J=1,Iij = ij+1ijkl = ijkl+I-J+1 DO K=I+1,M DO L=1,Kijkl = ijkl+1 xijkl[ijkl]=xkl[L] ENDDO ENDDOijkl = ijkl+ij+left ENDDOENDDO DO I=0,M-1 DO J=0,I DO K=0,M-I-2 DO L=0,I+K+1 tmp = ijkl+L+I*(K+(M+M*M+2*left+6)/4)+J*(left+(M+M*M)/2)+((I*I*M*M)+2*(K*K+3*K+I*I*(left+1))+M*I*I)/4+2 xijkl[tmp] = xkl[L+1] ENDDO ENDDO ENDDOENDDO IVS TRFD code segmentfrom Perfect Benchmarkwith IV updates TRFD after aggressiveinduction variable substitution NCSU 2/24/06
1 a0 2 a1 1 + + a2 x0 Example 3 (SSA) a = 1; a0 = 1while (a<10) { if (a0>=10) goto L2 x = a+2; L1: a = a+1; a1 = (a0, a2) } x0 = a1 + 2 a2 = a1+1 if (a2<10) goto L1 L2: GCC 4.x uses our approachapplied to SSA form.Note: GCC developers referto CRs as “scalar evolutions” a1 = {1,+,1} NCSU 2/24/06
1 i0 0 i1 1 x0 + i2 x1 + x2 Example 4 (SSA) x = 0; x0 = 0 i = 1; i0 = 1while (i<10) { if (i0>=10) goto L2 x = x+i; L1: x1 = (x0, x2) i = i+1; i1 = (i0, i2) } x2 = x1+i1 i2 = i1+1 if (i2<10) goto L1 L2: i1 = {1,+,1}x1 = {0,+,i1} = {0,+,1,+,1} NCSU 2/24/06
0 j0 2 j1 + 3 + j4 j2 j3 Example 5 (SSA) j0 = 0 i0 = 1 if (i0>=10) goto L2 L1: i1 = (i0, i2) j1 = (j0, j4) if (!p) goto L3 j2 = j1+2 goto L4 L3: j3 = j1+3 L4: j4 = (j2, j3) i2 = i1+1 if (i2<10) goto L1 L2: j = 0; i = 1; while (i<10) { if (p) j = j+2; else j = j+3; i = i+1; } {0,+,2} < j1< {0,+,3} NCSU 2/24/06
Recognizing Mixed Functional Forms and Reductions NCSU 2/24/06
Pointer Access Descriptions of Pointer and Array References • A pointer access description (PAD) [vanEngelen01] is a CR form of a pointer or array reference in a loop nest • PADs are computed with the CR-based IV algorithms short a[…], *p;int i;p = a;for(i=0;…;i++){} NCSU 2/24/06
CR-Enhanced Array Dependence Testing • Basic idea: construct dependence equations in CR form for both pointer and array accesses • Determine the solution intervals by computing the value ranges of the equations in CR form • If the solution space is empty, there is no dependence NCSU 2/24/06
Example S * float a[…], *p, *q; p = a; q = a+2*n; for (i=0; i<n; i++) { t = *p; S: *p++ = *q; *q-- = t; } Dependence equation:{a, +, 1}id = {a+2n, + ,-1}iuConstraints:0 <id<n-10 <iu<n-1 p={a, +, 1}q={a+2n, +, -1} Compute solution interval:Low[{{-2n, +, 1}iu, +, 1}id]= Low[{-2n, +, 1}iu]= -2n Up[{{-2n, +, 1}iu, +, 1}id]= Up[{-2n, +, 1}iu + n-1]= Up[-2n + 2n - 2]= -2 Rewrite dependence equation:{a, +, 1}id = {a+2n, +, -1}iu {a, +, 1}id - {a+2n, +, -1}iu= 0 {{-2n, +, 1}iu, +, 1}id= 0 No dependence NCSU 2/24/06
Determining the Value Range of a CR Form • Suppose x(i) = {x0, +, s(i-1)} for i = 0, …, n • If s(i-1) > 0 then x(i) is monotonically increasing • If s(i-1) < 0 then x(i) is monotonically decreasing • If a function is monotonic on its domain, then it is trivial to find its exact value range NCSU 2/24/06
Example: Nonlinear and Symbolic Dependence Testing float a[…], *p, *q;p = q = a; for (i=0; i<n; i++){ for (j=0; j<=i; j++) *q += *++p; q++; } DO i = 1, M+1 S1:A[I*N+10] = ... S2: ... = A[2*I+K] K = 2*K+N ENDDO S1: A[{N+10, +, N}i] S2: A[{K0+2N, +, K0+ N+2, *, 2}i] p = {{a+1, +, 1, +, 1}i, +, 1}j = a[(i2+i)/2+j+1]q = {a, +, 1}i = a[i] CR range test disprovesdependence whenK+N> 10 and K> 2 CR dep. test disprovesflow dependence (<, <) NCSU 2/24/06
Results • Implemented a CR-enhanced trapezoidal Banerjee test • Relatively simple test • Enhanced with support for nonlinear forms • Enhanced with support for conditional flow • Construct dependence equations in CR form • Implementation based on the Polaris compiler • Pros: can compare to powerful dependence tests such as Omega and Range test • Cons: Fortran only NCSU 2/24/06
Additional Independences Filtered over Omega Test Perf. Benchmark LAPACK NCSU 2/24/06
Additional Independences Filtered over Range Test NCSU 2/24/06
Additional Independences Filtered over Omega+Range NCSU 2/24/06
Percentage of Conditional IVs w/o Closed Forms in LAPACK NCSU 2/24/06
Timing Comparison: Perf Bench. NCSU 2/24/06
Timing Comparison: LAPACK NCSU 2/24/06
Conclusions • A CR-based compiler framework has advantages: • Applicable to CFG, AST, and SSA forms • Handles conditional flow • Handles nonlinear and symbolic induction variable expressions • Allows array and pointer-based dependence testing to be applied directly to the CR forms without induction variable substitution • Future work: • Improve GCC implementation • Enhance other dependence tests with CR forms NCSU 2/24/06
Further Reading • Robert van Engelen, Johnnie Birch, Yixin Shou, Burt Walsh, and Kyle Gallivan, “A Unified Framework for Nonlinear Dependence Testing and Symbolic Analysis”, in the proceedings of the ACM International Conference on Supercomputing (ICS), 2004, pages 106-115. • Robert van Engelen, Johnnie Birch, and Kyle Gallivan, “Array Dependence Testing with the Chains of Recurrences Algebra”, in the proceedings of the IEEE International Workshop on Innovative Architectures for Future Generation High-Performance Processors and Systems (IWIA), January 2004, pages 70-81. • Robert van Engelen and Kyle Gallivan, “An Efficient Algorithm for Pointer-to-Array Access Conversion for Compiling and Optimizing DSP Applications”, in proceedings of the 2001 International Workshop on Innovative Architectures for Future Generation High-Performance Processors and Systems (IWIA), January 2001, pages 80-89. • Robert van Engelen, “Efficient Symbolic Analysis for Optimizing Compilers”, in proceedings of the International Conference on Compiler Construction, ETAPS 2001, LNCS 2027, pages 118-132. NCSU 2/24/06
The End NCSU 2/24/06