390 likes | 441 Views
Optimistic and Pessimistic Parallelization. Dependences. Dependences: ordering constraints on program execution “At runtime, this instruction/statement must complete before that one can begin” Dependence analysis: compile-time analysis to determine dependences in program
E N D
Dependences • Dependences: ordering constraints on program execution • “At runtime, this instruction/statement must complete before that one can begin” • Dependence analysis: compile-time analysis to determine dependences in program • Needed to generate parallel code • Also needed to correctly transform a program by reordering statements • Safe approximation: it is OK for the compiler to assume more ordering constraints than are actually needed for correct execution • At the worst, this prevents you from doing some transformation or parallelization that would have been correct • In contrast, leaving out dependences would be unsafe
Two kinds of dependences • Data dependences • Arise from reads and writes to memory locations • Classified into flow, anti and output dependences • Control dependences • Arise from flow of control • (eg) statements in two sides of if-then-else are control dependent on predicate • We will not worry too much about control dependences in this course
Data Dependence Example • S1: x = 5; • S2: y = x; • S3: x = 3;
Flow Dependence • S1: x = 5; • S2: y = x; • S3: x = 3; (i) S1 is executed before S2 (ii)S1 must write to x before S2 reads from it Flow dependence S1 S2
Anti-dependence • S1: x = 5; • S2: y = x; • S3: x = 3; (i) S2 is executed before S3 (ii)S2 must read variable x before S3 overwrites it Anti-dependence S2 S3
Output Dependence • S1: x = 5; • S2: y = x; • S3: x = 3; (i) S1 is executed before S3 (ii)S1 must write to x before S3 overwrites it Output dependence S1 S3
Summary • Flow dependence Si Sj • Si is executed before Sj • Si writes to a location that is read by Sj • Anti-dependence Si Sj • Si is executed before Sj • Si reads from a location that is overwritten by Sj • Output dependence Si Sj • Si is executed before Sj • Si writes to a location that is overwritten by Sj
Are output dependences needed? • Goal of computing program dependences in a compiler is to determine partial order on program statement executions • From example, it seems that output dependence constraint is covered by transitive closure of flow and anti-dependences • So why do we need output dependences? • Answer: aliases
Aliases • Aliases: two or more program names that may refer to the same location • indirect array references (sparse matrices) • …… • A [X[I]] := …. • A [X[J]] := …. • assignment through pointers (trees, graphs) • …….. • *p1 := … • *p2 := … • call by reference • procedure foo(var x, var y) //call by reference • x : = 2; • y : = 3; • May-aliases: two names that may or may not refer to the same location • Must-aliases: two names that we know for sure must refer to the same location every time the statements are executed
Dependence analysis and aliasing • What constraints should we assume between these program statements? • …… • A [X[I]] := …. • A [X[J]] := …. • Answer: if the two names are aliases, we must order the two statement executions by a dependence to be safe • What kind of dependence makes sense? • output dependence
Dependence analysis in the presence of aliasing • Flow dependence Si Sj • Si is executed before Sj • Si may write to location that may be read by Sj • Anti-dependence Si Sj • Si is executed before Sj • Si may read from location that may be overwritten by Sj • Output dependence Si Sj • Si is executed before Sj • Si may write to location that may be overwritten by Sj
Dependences in loops • for (i = 0; i < 5; i++) { • S1: t = i * A[i]; • S2: A[i] = 3 * t; • } • S1 and S2 are executed many times • What does it mean to have a dependence S1 S2? • Answer: by convention, this means that there is a • dependence between some two instances of S1 and S2
Dependence analysis in the presence of loops and aliasing • Flow dependence Si Sj • Instance of Si is executed before instance of Sj • That instance of Si may write to a location that may be read by that instance of Sj • Anti-dependence Si Sj • Instance of Si is executed before instance of Sj • That instance of Si may read from a location that may be overwritten by that instance of Sj • Output dependence Si Sj • Instance of Si is executed before instance of Sj • That instance of Si may write to a location that may be overwritten by that instance of Sj
Dependence Example • for (i = 0; i < 5; i++) { • S1: t = i * A[i]; • S2: A[i] = 3 * t; • } o output S1 flow anti S2 If we think of A as a single monolithic location, there would be an output dependence S2 S2 More refined picture: treat each element of A as a different location no output dependence S2 S2
Parallel execution of loop • for (i = 0; i < 5; i++) { • S1: t = i * A[i]; • S2: A[i] = 3 * t; • } o output S1 flow anti S2 Can we execute loop iterations in parallel? Dependence inhibits parallel execution only if dependent source and destination statement instances are in different loop iterations In this example, dependences S1 S2 do not prevent parallel execution of loop iterations but other two dependences do.
Loop-carried vs loop-independent dependence • for (i = 0; i < 5; i++) { • S1: t = i * A[i]; • S2: A[i] = 3 * t; • } o output S1 flow anti S2 If source and destination of dependence are in different iterations, we will say dependence is loop-carried dependence. Otherwise it is loop-independent dependence. In this example, dependences S1 S2 are loop-independent. Only loop-carried dependences inhibit inter-iteration parallelism.
Transformations to enhance parallelism • for (i = 0; i < 5; i++) { • S1: t[i]= i * A[i]; • S2: A[i] = 3 * t[i]; • } S1 flow S2 In many programs, we can perform transformations to enhance parallelism In this example, all dependences are loop-independent. So all iterations can be executed in parallel. This is called a DO-ALL loop. To get this parallel version, we expanded variable t into an array. Transformation: scalar expansion
Example • for (i = 1; i <= N; i++) • for (j = 1; j <= N; j++) • for (k = 1; k <= N; k++) • C[i,j] = C[i,j] + A[i,k]*B[k,j} What is the dependence graph? Notice that two outer loops are parallel. Inner loop is not parallel (unless you allow additions to be done in any order) Notion of loop-carried/loop-independent dependence must be generalized for multiple loop case: dependence vectors
Questions • How do we compute the dependence relation of a loop nest? • What transformations should we perform to enhance parallelism in loop nests? • What abstractions of the dependence relation are useful for parallelization and transformation of loop nests? • Answers are very dependent on data structures used in the code • We know a lot about array programs • Problem is much harder for programs that manipulate irregular data structures like graphs
Delaunay Meshes • Meshes useful for • Finite element method for solving PDEs • Graphics rendering • Delaunay meshes (2-D) • Triangulation of a surface, given vertices • Delaunay property: circumcircle of any triangle does not contain another point in the mesh • Related to Voronoi diagrams
Delaunay Mesh Refinement • Want all triangles in mesh to meet quality constraints • (e.g.) no angle < 30° • Mesh refinement: fix bad triangles through iterative refinement
Iterative Refinement • Choose “bad” triangle
Mesh Refinement • Choose “bad” triangle • Add new vertex at center of circumcircle
Mesh Refinement • Choose “bad” triangle • Add new vertex at center of circumcircle • Gather all triangles that no longer satisfy Delaunay property into cavity
Mesh Refinement • Choose “bad” triangle • Add new vertex at center of circumcircle • Gather all triangles that no longer satisfy Delaunay property into cavity • Re-triangulate affected region, including new point
Mesh Refinement • Choose “bad” triangle • Add new vertex at center of circumcircle • Gather all triangles that no longer satisfy Delaunay property into cavity • Re-triangulate affected region, including new point • Add newly created bad triangles to worklist • Iterate till no more bad triangles • Final mesh depends on the order of processing of bad triangles but any order will give you a good mesh at the end
Refinement Example Original Mesh Refined Mesh
Program Mesh m = /* read in mesh */ WorkQueue wq; wq.enqueue(mesh.badTriangles()); while (!wq.empty()) { Triangle t = wq.dequeue(); //choose bad triangle Cavity c = new Cavity(t); //determine new vertex c.expand(); //determine affected triangles c.retriangulate(); //re-triangulate region m.update(c); //update mesh wq.enqueue(c.badTriangles()); //add new bad triangles to queue }
Parallelization Opportunities Bad triangles with non-overlapping cavities can be processed in parallel.
Parallelization Study • Estimated available parallelism for mesh of 1M triangles • Actual ability to exploit parallelism dependent on scheduling of processing • C. Antonopolous, X. Ding, A. Chernikov, F. Blagojevic, D. Nikolopolous and N. Chrisochoides Multigrain parallel Delaunay Mesh generation, ICS05
Problem • Identifying dependences at compile time tractable for scalars, arrays • Second half of course... • What if we cannot determine statically what the dependences are? • Pointer based data structures hard to analyze • Dependences may be input-dependent • This is the case for mesh generation
Solution: optimistic parallelization • Like speculative execution in microprocessors • Execute speculatively, correct if mistake is made • Idea: Execute code in parallel speculatively • Perform dynamic checks for dependence between parallel code • If dependence is detected, parallel execution was not correct • Roll back execution and try again!
Program Transformation: use the right abstractions • Using a queue introduces dependences that have nothing to do with the problem • Abstractly, all we need is a set of bad triangles • Queue is over-specification • WorkSet abstraction • getAny operation • Does not make ordering guarantees • Removes dependence between iterations • In the absence of cavity interference, iterations can execute in parallel • Replace WorkQueue with WorkSet • compare this with scalar expansion in array case • similar in spirit
Rewritten Program Mesh m = /* read in mesh */ WorkSet ws; ws.add(g.badNodes()); while (!ws.empty()) { Triangle t = ws.getAny(); //choose bad node Cavity c = new Cavity(t); //determine new vertex c.expand(); //determine affected triangles c.retriangulate(); //re-triangulate region m.update(c); //update mesh ws.add(c.badTriangles()); //add new bad nodes to set }
Optimistic parallelization • Can now exploit “getAny” to parallelize loop • Can try to expand cavities in parallel • Expansions can still conflict • In practice, most cavities can be expanded in parallel safely • No way to know this a priori • Only guaranteed safe approach is serialization • What if we perform parallelization without prior guarantee of safety?
Parallelization Issues • Must ensure that run-time checks are efficient • How do we perform roll backs • Would like to minimize conflicts • Scheduling becomes important • Number of available cavities for expansion exceeds computational resources • Choose cavities to expand to minimize conflicts • Empirical testing: ~30% of cavity expansions conflict
Questions • What are the right abstractions for writing irregular programs? • WorkQueue vs. WorkSet • How do we determine where to apply optimistic parallelization? • How do we perform dependence checks dynamically w/o too much overhead? • Can hardware support help? • How do we implement roll-backs? • How do we reduce the probability of roll-backs?