360 likes | 495 Views
2004. Recurrence Chain Partitioning of Non-Uniform Dependences. Yijun Yu Erik H. D ’ Hollander. Overview. Dependence and Parallelism Non-Uniform Loop Dependences Recurrence Chains Partitioning Related work Implementations Experiment Results Summary. 0. 0. 0. 0. 0. 0. 0. 1. 2.
E N D
2004 Recurrence Chain Partitioning of Non-Uniform Dependences Yijun Yu Erik H. D’Hollander Aug 15-18, Montreal, Canada
Overview • Dependence and Parallelism • Non-Uniform Loop Dependences • Recurrence Chains Partitioning • Related work • Implementations • Experiment Results • Summary Aug 15-18, Montreal, Canada
0 0 0 0 0 0 0 1 2 3 1 1 0 1 3 0 3 1 3 2 0 1 2 1 3 3 0 0 0 0 0 0 shared memory execution trace A(1) = A(0) A(2) = A(1) A(3) = A(2) A(2) = A(1) A(1) = A(0) A(3) = A(2) 1. Background Dependence vs. Parallelism program DO I = 1,3 A(I) = A(I-1) ENDDO DOALL I = 1,3 A(I) = A(I-1) ENDDO Aug 15-18, Montreal, Canada
The CFD application @ WTCM • Computation Fluid Dynamics CFDNavier-Stokes equations • Successive Over-Relaxation SOR 3D geometry + 1D time temperature Aug 15-18, Montreal, Canada
The visualized Uniform dependences and transformations for the 4D loop Before transformation After transformation A 3-D unimodular transformation is found after visualizing the 4D loop nest which has 177 array references at run-time for each iteration. Here we use a regular shape. The transformation makes it possible to speed-up the program around N2/6 times where N is the diameter of the geometry. (Yu, Parco99) Aug 15-18, Montreal, Canada
2. Non-uniform dependences • Uniform loop dependences • Dependent iterations are apart at a uniform distance in the iteration space: a set of distance vector can predict the dependences and indicate the affine index loop transformation to reveal the maximal loop parallelism. • Non-uniform dependences • Irregular, can be caused by complex subscripts, compile-time unknowns, etc. • But not rare: in SPECfp95 benchmarks 46% nested loops and 12.8% of the coupled subscripts Aug 15-18, Montreal, Canada
Non-uniform dependencesTip of the iceberg Aug 15-18, Montreal, Canada
Speedup:13.3 Irregular dependence • Dependences have non-uniform distance • Parallelism Analysis:200 iterations over 15 data flow steps Problem: How to exploit it? Aug 15-18, Montreal, Canada
3. Recurrence Chain PartitioningResearch objectives If DO loops fail to reveal the optimal parallelism for irregular dependences, can one use WHILE loops? • WHEN can one apply WHILE loops? • HOW to construct WHILE loops? • WHAT to do when one can not apply WHILE loops? • HOW MUCH can be achieved by an evaluation purposes? Aug 15-18, Montreal, Canada
3.1 How to Generate code? • DOALL I = INIT(I) WHILE !TERMINATE(I) DO S(I) I = NEXT(I) END DOENDDOALL • INIT(I) =? • TERMINATE(I)=? • NEXT(I) =? Aug 15-18, Montreal, Canada
3.2 Solving recurrence equations in the unified iteration space • Dependence equations: iA + a = jB + b • Recurrence equations: j = i T + t or i = (j – t) T-1 = jT-1+ tT-1 • T = AB-1 • t = (a – b)B-1 • A recurrence chain is a sequence of dependent iterations, such that • iK+1 = iKT+ t, or iK+1= (iK-t)T-1 • i0={ i | not exist j such that iA+a = jB+b or iB+b = jA+a} • We have variable dependence distance dk=ik+1-ik: • dk+1 = dkT or dk=dk+1T-1 • d is not constant and exponential to a=max(1/|T|, |T|), thus the dependence chain length is O(loga L), where L is the diameter of the iteration space • When |T| is negative, one can cut recurrence chain to 2 iterations by lexicographical ordering Aug 15-18, Montreal, Canada
3.3 Generate code ? • DOALL I = i0WHILE ( I is in Iteration Space) DO S(I) I = IT+t or I = (I-t)T-1ENDDO ENDDOALL • Problem: How to tell which index update respects the dependency order? Aug 15-18, Montreal, Canada
I2 I1 initial set final set R1 independent i0 non-integer integer i2 R2 i3 intermediateset i1 i0 non-integer i4 integer R3 R4 i0 i0 i1 cyclic iteration space Aug 15-18, Montreal, Canada
3.3 Generate code ! • DOALL I in P1 IF (IT+t < I) T = T-1; t = tT ENDIFWHILE ( I is in Iteration Space) DO S(I) I = IT+tENDDO ENDDOALL Aug 15-18, Montreal, Canada
4. Related work Strength of REC(1) Scalability • LEN = length of the chain • In comparison, unique-set oriented methods have to deal with LEN = 2, 3, … differently… • In REC, the WHILE loops adjust their steps automatically… Aug 15-18, Montreal, Canada
4. Related work Strength of REC(2) Outermost loop parallelism • Set-oriented:DOALL I in P1 S(I)DOALL I in P2 S(I)…DOALL I in Pn D(I) • Recurrence ChainDOALL I in P1 IF (I > IT+t) T = T-1; t = tTWHILE ( I in IS) DO S(I) I = IT+tENDDO ENDDOALL Aug 15-18, Montreal, Canada
4. Related workShortcoming and alternatives • Restriction in number of dep. Equations • Fall back to the following algorithms: • A recursive 3-sets partitioning (3P) (similar to unique-sets partitioning, but more accurate): can reuse the calculations for P1, P2, P3. • PDM and other uniformization techniques PDM is light-weight and can apply first, then apply 3P. Aug 15-18, Montreal, Canada
Loop Partitioning GOAL MODEL Aug 15-18, Montreal, Canada
REC sat den fully partly Aug 15-18, Montreal, Canada
3Region sat den fully partly Aug 15-18, Montreal, Canada
PDM sat den fully partly Aug 15-18, Montreal, Canada
4. Implementations Front end: source to source transformations • PDM/PL in FPT • Set-oriented algorithms in FPT <-> XML/XSLT <-> OC Back end • Intel Fortran compiler + OPENMP directives Experiments on an EPICMP 4-CPU server Aug 15-18, Montreal, Canada
5. Results5.1 Yu, ICPP00 DO I1=1,N1 DO I2=1,N2 a(3*I1+1,2*I1+I2-1) =a(I1+3,I2+1) ENDDO ENDDO Aug 15-18, Montreal, Canada
5.1 Nonfull-rank PDM j1 i2 Aug 15-18, Montreal, Canada j2
5.2 Ju, 1997’s example DO I=1,N DO J=1,N a(2*I+3,J+1) = … =a(I+2*J+1,I+J+3) ENDDO ENDDO det(PDM) = 2 Aug 15-18, Montreal, Canada
UNIQUE vs REC partitioning 13 2 Aug 15-18, Montreal, Canada
Ju’s ExampleComparison • We corrected the loop bounds flaw in the Ju’s 97 paper and 5 unique sets were derived for this case when N = 12. • But theoretically O(2^(log2 N)) = O(N) UNIQUE sets are needed • In REC partitioning, just one set P1 needs to be calculated for the initial i0 Aug 15-18, Montreal, Canada
5.3 Chen, 96’s Example DO I=1,N DO J=1,I DO K=J,I ... = a(I+2*K+5,4*K-J) ENDDO a(I-J,I+J)= ... ENDDO ENDDO Aug 15-18, Montreal, Canada
Chen’s Example A special case • It is a non-perfectedly nested loop • First convert it into the unified iteration space • Then symbolically calculate P1, P2, P3 and finds P2 = empty • Therefore the recurrence chains are at most 1 iteration long, regardless to the loop bounds • Both REC and Three-region partitioning lead to the same optimal solution Aug 15-18, Montreal, Canada
Loop Fusion 5.4 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0 C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF 1 CONTINUE C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K) Aug 15-18, Montreal, Canada
After loop fusion Recursive Three Region partitioning Aug 15-18, Montreal, Canada
6. Summary PDM 3R REC • Recurrence Chain partitioning is scalable to any size of the iteration space • REC partitioning reveals outermost parallelism, no synchronization between partitioned regions • The limitation of REC partitioning and its compensation: we provide fall back alternatives, if REC can not apply (1) PDM + Minimal distance (always applicable) (2) Recursive three-region partitioning (applicable for constant loop bounds, in some cases (e.g. Chen’s example) any loop bounds) Aug 15-18, Montreal, Canada