380 likes | 394 Views
Implementing automatic parallelization through the use of Value Evolution Graph for efficient program optimization and performance enhancement.
E N D
The ValueEvolution Graph And its Applications to Automatic Parallelization Silvius Rus, Dongmin Zhang, and Lawrence Rauchwerger
Automatically Parallelized !$OMP PARALLEL DO DO i = 1, 100 B(i) = 1 ENDDO q = 100 Motivating Example: Parallelization Sample Code q = 0 DO i = 1, 100 q = q+1 B(q) = 1 ENDDO • Classic Solution • Induction Variable Substitution: q f(i) = i • Dependence Test: 1 ≤ i1 ≤ 100 1 ≤ i2 ≤ 100 i1 i2 f(i1) = f(i2)
After Induction Variable Recognition/Substitution 1 old = p 2 3 DO i = 1, old 4 5 B(i) = 1 6 IF (A(i).GT.0) 7 p = p+1 8 A(p) = 0 9 ENDIF 10 ENDDO Anti Flow Output Array B is independent Array A is dependent Motivating Example: Parallelization Sample Code 1 old = p 2 q = 0 3 DO i = 1, old 4 q = q+1 5 B(q) = 1 6 IF (A(q).GT.0) 7 p = p+1 8 A(p) = 0 9 ENDIF 10 ENDDO q is substituted with closed form q = i p cannot be substituted with a closed form
After Induction Variable Recognition/Substitution 1 old = p 2 3 DO i = 1, old 4 5 6 IF (A(i).GT.0) 7 p = p+1 8 A(p) = 0 9 ENDIF 10 ENDDO Anti Flow Output p(8)[1:old] p(8)[1:old] p(8) non-repeating Motivating Example: Parallelization
STEP Recurrence Properties 1 old = p 3 DO i = 1, old 6 IF (A(i).GT.0) 7 p = p+1 8 A(p) = 0 9 ENDIF 10 ENDDO Cross-iteration mutually independent if pstrictly increasing, or step(p|i=k, p|i=k+1) > 0, k [1:old]
STEP Recurrence Properties IMAGE 1 old = p 3 DO i = 1, old 6 IF (A(i).GT.0) 7 p = p+1 8 A(p) = 0 9 ENDIF 10 ENDDO Independent if p and i belong to disjoint sets, or image(p|i[1:old]) image(i)|i[1:old])=
A Simple Value Evolution Graph Static Single Assignment Form Sample Code 1 p= 0 2 IF (cond) 3 p = p+5 4 ELSE 5 p = p+7 6 ENDIF 7 IF (p>0) 8 … 9 ENDIF 1 p1= 0 2 IF (cond) 3 p2= p1+5 4 ELSE 5 p3 = p1+7 6 ENDIF p4 = γ(p2, p3, cond) 7 IF (p4>0) 8 … 9 ENDIF p1:0 5 7 p2 p3 0 0 p4 p4 = p1 + 5 + 0 p4 > p1 p4 > 0 p4 = p1 + 7 + 0 p4 > p1 p4 > 0
The Value Evolution Graph 1 old = p0 3 DO i = 1, old p1 = μ(p0, p3) 5 B(i) = 1 6 IF (A(i).GT.0) 7 p2 = p1+1 8 A(p2) = 0 9 ENDIF p3 = γ(p1, p2, A(i).GT.0) 10 ENDDO p4 = η(p0, p1)
Our Solution: The Value Evolution Graph 1 old = p0 3 DO i = 1, old p1 = μ(p0, p3) 5 B(i) = 1 6 IF (A(i).GT.0) 7 p2 = p1+1 8 A(p2) = 0 9 ENDIF p3 = γ(p1, p2,A(i).GT.0) 10 ENDDO p4 = η(p0, p1) p1 1 0 p2 0 p3 VEG for the loop body • VEG: • acyclic graph, GSA names as nodes • one for each loop body/subprogram
0 old p0 0 0 p1 [0:old] p4 VEG for the outer context Our Solution: The Value Evolution Graph 1 old = p0 3 DO i = 1, old p1 = μ(p0, p3) 5 B(i) = 1 6 IF (A(i).GT.0) 7 p2 = p1+1 8 A(p2) = 0 9 ENDIF p3 = γ(p1, p, A(i).GT.0) 10 ENDDO p4 = η(p0, p1) • VEG: • acyclic graph, GSA names as nodes • one for each loop body/subprogram
0 old p0 0 0 p1 [0:old] p4 VEG for the outer context Our Solution: The Value Evolution Graph 1 old = p0 3 DO i = 1, old p1 = μ(p0, p3) 5 B(i) = 1 6 IF (A(i).GT.0) 7 p2 = p1+1 8 A(p2) = 0 9 ENDIF p3 = γ(p1, p2, A(i).GT.0) 10 ENDDO p4 = η(p0, p1) p1 1 0 p2 0 p3 VEG for the loop body • VEG: • acyclic graph, GSA names as nodes • one for each loop body/subprogram • hierarchical relations among VEGs
p1 VEG Nodes p0 = 0 DO i = 1, N p1 = μ(p0, p4) IF (A(i).GT.0) p2 = p1+1 ELSE p3 = 0 ENDIF p4 = γ(p2, p3, A(i).GT.0) ENDDO p0 μ 1 Input p3:0 p2 Regular 0 0 p4 Back Input: result of assignment of loop invariant μ : merges value from outside with loop-back Back: last value in one iteration Regular: all others
VEG Edges p1 p1 = … IF (A(i).GT.0) p2 = p1+1 ENDIF p3 = γ(p1, p2, A(i).GT.0) (+1, .TRUE.) p2 (+0, A(i).LE.0) (+0, A(i).GT.0) p3
VEG Distance p1 p1 = … IF (A(i).GT.0) p2 = p1+1 ENDIF p3 = γ(p1, p2, A(i).GT.0) 1 p2 0 0 p3 distance(p1,p3) = [ ShortestPath(p1,p3) : LongestPath(p1,p3) ] distance(p1,p3) = [0:1]
Recurrence Properties step(p2|i=k, p2|i=k+1) = distance(p2, p3) + distance(p1, p2) = 0 + 1 = 1 p1 1 0 p2 0 Back Node μ-Node p3 p0 = 0 DO i = 1, N p1 = μ(p0, p3) IF (A(i).GT.0) p2 = p1+1 ENDIF p3 = γ(p1, p2) ENDDO
Recurrence Properties step(p2|i=k, p2|i=k+1) = distance(p2, p3) + distance(p1, p2) = 0 + 1 = 1 p1 1 0 p2 0 Back Node μ-Node p3 p0 = 0 DO i = 1, N p1 = μ(p0, p3) IF (A(i).GT.0) p2 = p1+1 ENDIF p3 = γ(p1, p2) ENDDO image(p2) i[1:N]= initial value(p1) + step(p1|i=k, p1|i=k+1) * [0:N–1] + distance(p1, p2) = 0 + [0:1]*[0:N-1] + 1 = [1:N]
Recurrence Properties step(p2|i=k, p2|i=k+1) = distance(p2, p3) + distance(p1, p2) = 0 + 1 = 1 p1 1 0 p2 0 Back Node μ-Node p3 p0 = 0 DO i = 1, N p1 = μ(p0, p3) IF (A(i).GT.0) p2 = p1+1 ENDIF p3 = γ(p1, p2) ENDDO image(p2) i[1:N]= initial value(p1) + step(p1|i=k, p1|i=k+1) * [0:N–1] + distance(p1, p2) = 0 + [0:1]*[0:N-1] + 1 = [1:N] last value(p1) i=N= initial value(p1) + step(p1|i=k, p1|i=k+1) * N = 0 + [0:1]*N = [0:N]
q1 p1 1 1 0 p2 0 q2 p3 Recurrence Properties old = p0 q0 = 0 DO i = 1, old q1 = μ(q0, q2) p1 = μ(p0, p3) q2 = q1+1 B(q2) = 1 IF (A(i).GT.0) p2 = p1+1 A(p2) = 0 ENDIF p3 = γ(p1, p2) ENDDO No Closed Form Closed Form step(q2, q2) = 1 B(q2) independent step(p2, p2) = 1 A(p2) independent
f1:0 f2:1 0 0 f4+0 [1:1] f3+0 [1:1] f3 2 f3+2 [1:1] f1 [1:1] f2 [1:1] f4 (0,c2) f1+2 [1:1] f2+2 [1:1] 0 [1:1] 1 [1:1] 0 2 [1:1] 3 [1:1] f5 ? (f5.EQ.1) c2 Logic Inference on the VEG 1 f1 = 0 2 IF (c1) 3 f2 = 1 4 ENDIF 5 f3 = γ(f1,f2,c1) 6 IF (c2) 7 value = … 8 ELSE 9 f4 = f3+2 10 ENDIF 11 f5 = γ(f3,f4,c2) 12 IF (f5.EQ.1) 13 PRINT *, value 14 ENDIF f5 [1:1] Extract range: f5.EQ.1 f5 [1:1] Propagate value from 7 to 13 Trace range backwards: f5 [1:1]
f3.EQ.1 cond p1 p1 p1 1 1 1 0 0 p2 p2 p2 0 0 0 f1:0 f2:1 p3 p3 p3 -1 -1 -1 0 0 0 p4 p4 p4 f3 0 0 0 p5 p5 p5 VEG before Pruning After GSA-Path Pruning After VEG-based GSA-Path Pruning [ Tu, Padua, ICS95 ] VEG Pruning f3.EQ.1 f3.GT.0 1 A(p1) = … 2 f1 = 0 3 IF (cond) 4 f2 = 1 5 p2 = p1+1 6 ENDIF p3 = γ(p1, p2, cond) f3 = γ(f1, f2, cond) 7 IF (f3.GT.0) 8 p4 = p3-1 9 ENDIF p5 = γ(p3, p4, f3.GT.0) 10 IF (f3.EQ.1) 11 … = A(p5) 12 ENDIF Is… = A(p5)covered byA(p1) = …? p5[p1-1:p1 +1] p5[p1-1:p1] p5 = p1
Automatic Parallelization Framework [Rus, Rauchwerger, Hoeflinger 2002] PARALLELIZATION Generation of Parallel Code Privatization Analysis Dependence Analysis DATAFLOW Memory Classification Analysis
Memory Classification Analysis [Hoeflinger 1998] • Memory reference set partition • Provides array dataflow/dependence information • Relies heavily on closed forms ReadOnly (A) = { 2 } WriteFirst (A) = { 3 } ReadWrite (A) = { 1 } A(3) = A(1) + A(2) A(1) = A(3) + A(2)
WF :predwrite [p : p+lengthwrite] Recurrence :predstep { p = p + lengthstep } Memory Reference Sequences Stack push 1 DO i = 1, N 2 p = 0 3 DO j = 1, M 4 IF (…) 5 p = p+1 6 A(p) = … 7 ENDIF 8 ENDDO 9 DO j = 1, p 10 … = A(j) 11 ENDDO 12 ENDDO Contiguous: predstep predwrite, lengthstep lengthwrite P3M / PP_do100 Increasing: predstep predwrite, lengthstep lengthwrite Is Aprivatizable in the outer loop? Yes,contiguous write in inner loop Is the inner loop independent? Yes, increasing in inner loop Consecutive: predstep predwrite, lengthstep= lengthwrite
Pushback Sequences Conditional Pushback DO i = 1, N IF (C(i).EQ.1) A(p) = … p = p+1 ENDIF ENDDO HYDRO2D WNFLE_do10
Pushback Sequences Conditional Pushback, Stack lookup Conditional Pushback DO i = 1, N IF (C(i).EQ.1) A(p) = … p = p+1 ENDIF ENDDO old = p DO i = 1, N next = p+1 same = 0 A(next) = … DO j = 1, old IF (A(j).EQ.A(next)) same = 1 ENDIF ENDDO IF (same.EQ.0) p = next ENDIF ENDDO HYDRO2D WNFLE_do10 TRACK FPTRAK_do300
Pushback Sequences Conditional Pushback, Stack lookup & update Conditional Pushback, Stack lookup Conditional Pushback DO i = 1, N IF (C(i).EQ.1) A(p) = … p = p+1 ENDIF ENDDO old = p DO i = 1, N next = p+1 same = 0 A(next) = … DO j = 1, old IF (A(j).EQ.A(next)) same = 1 ENDIF ENDDO IF (same.EQ.0) p = next ENDIF ENDDO old = p DO i = 1, N ifdata = p+1 DO k = 1, M A(p+1) = … DO j = ifdata, p IF (A(1,j).EQ.A(1,p+1)) A(2,j) = A(2,j)+A(2,p+1) same = 1 ENDIF ENDDO IF (same.EQ.0) p = p+1 ENDIF ENDDO ENDDO HYDRO2D WNFLE_do10 TRACK FPTRAK_do300 TRACK / EXTEND_do400
Pushback Sequences • Detection • Consecutive WF • Parallelization • Accumulation to private storage • Simple copy-out to shared storage
Implementation in Polaris PARALLELIZATION Generation of Parallel Code Privatization Analysis Dependence Analysis VEG-based Analysis DATAFLOW Memory Classification Analysis
Implementation in Polaris PARALLELIZATION Generation of Parallel Code Privatization Analysis Dependence Analysis VEG-based Analysis DATAFLOW Memory Classification Analysis Partially aggregated descriptors are fed to VEG-based analysis
Implementation in Polaris PARALLELIZATION Generation of Parallel Code Privatization Analysis Dependence Analysis VEG-based Analysis DATAFLOW Memory Classification Analysis Contiguous sequences lead to more accurate dataflow information
Implementation in Polaris PARALLELIZATION Generation of Parallel Code Privatization Analysis Dependence Analysis VEG-based Analysis DATAFLOW Memory Classification Analysis More storage dependences eliminated by privatization
Implementation in Polaris PARALLELIZATION Generation of Parallel Code Privatization Analysis Dependence Analysis VEG-based Analysis DATAFLOW Memory Classification Analysis Closer value ranges, increasing sequences less false dependences
Implementation in Polaris PARALLELIZATION Generation of Parallel Code Privatization Analysis Dependence Analysis VEG-based Analysis DATAFLOW Memory Classification Analysis Efficient pushback sequence parallelization
Experimental Results Seq% = Sequential Time (loop) / Sequential Time (whole application)
Pushbacks in PERFECT More in C and C++ codes !
Conclusions Value Evolution Graph Memory Reference Analysis Comparison Array Dataflow Range Privatization Recurrences Dependence Analysis Logic Inferences Pushback Parallelization
Sample VEGs EXTEND_do400 EXTEND_do300