1 / 22

Coarse-Grain Parallelism

Explore transformations for symmetric multiprocessor machines, focusing on synchronization elements, privatization, loop distribution, alignment, and fusion techniques. Learn how to optimize parallelism for enhanced performance.

leduardo
Download Presentation

Coarse-Grain Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy

  2. Introduction • Previously, our transformations targeted vector and superscalar architectures. • In this lecture, we worry about transformations for symmetric multiprocessor machines. • The difference between these transformations tends to be one of granularity.

  3. Review • SMP machines have multiple processors all accessing a central memory. • The processors are unrelated, and can run separate processes. • Starting processes and synchonrization between proccesses is expensive.

  4. Synchonrization • A basic synchonrization element is the barrier. • A barrier in a program forces all processes to reach a certain point before execution continues. • Bus contention can cause slowdowns.

  5. Single Loops • The analog of scalar expansion is privatization. • Temporaries can be given separate namespaces for each iteration. DO I == 1,N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO PARALLEL DO I = 1,N PRIVATE t S1 t = A(I) S2 A(I) = B(I) S3 B(I) = t ENDDO

  6. Privatization Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x. Privatizability can be stated as a data-flow problem: We can also do this by declaring a variable x private if its SSA graph doesn’t contain a phi function at the entry.

  7. Array Privatization We need to privatize array variables. For iteration J, upwards exposed variables are those exposed due to loop body without variables defined earlier. DO I = 1,100 S0 T(1)=X L1 DO J = 2,N S1 T(J) = T(J-1)+B(I,J) S2 A(I,J) = T(J) ENDDO ENDDO So for this fragment, T(1) is the only exposed variable.

  8. Array Privatization • Using this analysis, we get the following code: PARALLEL DO I = 1,100 PRIVATE t S0 t(1) = X L1 DO J = 2,N S1 t(J) = t(J-1)+B(I,J) S2 A(I,J)=t(J) ENDDO ENDDO

  9. Loop Distribution • Loop distribution eliminates carried dependencies. • Consequently, it often creates opportunity for outer-loop parallelism. • We must add extra barriers to keep dependent loops from executing out of order, so the overhead may override the parallel savings. • Attempt other transformations before attempting this one.

  10. Alignment • Many carried dependencies are due to array alignment issues. • If we can align all references, then dependencies would go away, and parallelism is possible. DO I = 2,N A(I) = B(I)+C(I) D(I) = A(I-1)*2.0 ENDDO DO I = 1,N+1 IF (I .GT. 1) A(I) = B(I)+C(I) IF (I .LE. N) D(I+1) = A(I)*2.0 ENDDO

  11. Alignment • There are other ways to align the loop: DO I = 2,N J = MOD(I+N-4,N-1)+2 A(J) = B(J)+C D(I)=A(I-1)*2.0 ENDDO D(2) = A(1)*2.0 DO I = 2,N-1 A(I) = B(I)+C(I) D(I+1) = A(I)*2.0 ENDDO A(N) = B(N)+C(N)

  12. Alignment • If an array is involved in a recurrence, then alignment isn’t possible. • If two dependencies between the same statements have different dependency distances, then alignment doesn’t work. • We can fix the second case by replicating code: DO I = 1,N A(I+1) = B(I)+C ! Replicated Statement IF (I .EQ 1) THEN t = A(I) ELSE t = B(I-1)+C END IF X(I) = A(I+1)+t ENDDO DO I = 1,N A(I+1) = B(I)+C X(I) = A(I+1)+A(I) ENDDO

  13. Alignment Theorem:Alignment, replication, and statement reordering are sufficient to eliminate all carried dependencies in a single loop containing no recurrence, and in which the distance of each dependence is a constant independent of the loop index • We can establish this constructively. • Let G = (V,E,) be a weighted graph. v  V is a statement, and (v1, v2) is the dependence distance between v1 and v2. Let o: V Z give the offset of vertices. • G is said to be carry free if o(v1) + (v1, v2) = o(v2).

  14. Alignment Procedure procedure Align(V,E,,0) While V is not empty remove element v from V for each (w,v)  E if w  V W  W  {w} o(w)  o(v) - (w,v) else if o(w) != o(v) - (w,v) create vertex w’ replace (w,v) with (w’,v) replicate all edges into w onto w’ W  W  {w’} o(w)’  o(v) - (w,v) for each (v,w)  E if w  V W  W {w} o(w)  o(v) + (v,w) else if o(w) != o(v) + (v,w) create vertex v’ replace (v,w) with (v’,w) replicate edges into v onto v’ W  W  {v’} o(v’)  o(w) - (v,w) end align

  15. Loop Fusion • Loop distribution was a method for separating parallel parts of a loop. • Our solution attempted to find the maximal loop distribution. • The maximal distribution often finds parallelizable components to small for efficient parallelizing. • Two obvious solutions: • Strip mine large loops to create larger granularity. • Perform maximal distribution, and fuse together parallelizable loops.

  16. Fusion Safety Definition: A loop-independent dependence between statements S1 and S2 in loops L1 and L2 respectively is fusion-preventing if fusing L1 and L2 causes the dependence to be carried by the combined loop in the opposite direction. DO I = 1,N S1 A(I) = B(I)+C ENDDO DO I = 1,N S2 D(I) = A(I+1)+E ENDDO DO I = 1,N S1 A(I) = B(I)+C S2 D(I) = A(I+1)+E ENDDO

  17. Fusion Safety • We shouldn’t fuse loops if the fusing will violate ordering of the dependence graph. • Ordering Constraint: Two loops can’t be validly fused if there exists a path of loop-independent dependencies between them containing a loop or statement not being fused with them. Fusing L1 with L3 violates the ordering constraint. {L1,L3} must occur both before and after the node L2.

  18. Fusion Profitability Parallel loops should generally not be merged with sequential loops. Definition: An edge between two statements in loops L1 and L2 respectively is said to be parallelism-inhibiting if after merging L1 and L2, the dependence is carried by the combined loop. DO I = 1,N S1 A(I+1) = B(I) + C ENDDO DO I = 1,N S2 D(I) = A(I) + E ENDDO DO I = 1,N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO

  19. Typed Fusion • We start off by classifying loops into two types: parallel and sequential. • We next gather together all edges that inhibit efficient fusion, and call them bad edges. • Given a loop dependency graph (V,E), we want to obtain a graph (V’,E’) by merging vertices of V subject to the following constraints: • Bad Edge Constraint: vertices joined by a bad edge aren’t fused. • Ordering Constraint: vertices joined by path containing non-parallel vertex aren’t fused

  20. Typed Fusion Procedure procedure TypedFusion(G,T,type,B,t0) Initialize all variables to zero Set count[n] to be the in-degree of node n Initialize W with all nodes with in-degree zero while W isn’t empty remove element n with type t from W if t = t0 if maxBadPrev[n] = 0 then p  fused else p  next[maxBadPrev[n]] if p != 0 then x  node[p] num[n]  num[x] update_successors(n,t) fuse x and n and call the result n else create_new_fused_node(n) update_successors(n,t) else create_new_node(n) update_successors(n,t) end TypedFusion

  21. Typed Fusion Example Original loop graph Graph annotated (maxBadPrev,p)  num After fusing parallel loops After fusing sequential loops

  22. Cohort Fusion • Given an outer loop containing some number of inner loops, we want to be able to run some inner loops in parallel. • We can do this as follows: • Run TypedFusion with B = {fusion-preventing edges, parallelism-inhibiting edges, and edges between a parallel loop and a sequential loop} • Put a barrier at the end of each identified cohort • Run TypedFusion again to fuse the parallel loops in each cohort

More Related