1 / 49

Using the Iteration Space Visualizer in Loop Parallelization

Using the Iteration Space Visualizer in Loop Parallelization. Yijun YU http://winpar.elis.rug.ac.be/ppt/isv. Overview. ISV – A 3D Iteration Space Visualizer : view the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values

dusty
Download Presentation

Using the Iteration Space Visualizer in Loop Parallelization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Iteration Space Visualizer in Loop Parallelization Yijun YU http://winpar.elis.rug.ac.be/ppt/isv

  2. Overview ISV – A 3D Iteration Space Visualizer : view the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values • Detectthe parallelism • Estimate the speedup • Derive a loop transformation • Find Statement-level parallelism • Future development

  3. 0 3 0 0 0 0 1 2 3 1 1 3 0 3 0 1 1 0 0 3 2 1 3 2 1 0 0 0 0 0 0 0 shared memory execution trace A(1) = A(0) A(2) = A(1) A(3) = A(2) A(2) = A(1) A(1) = A(0) A(3) = A(2) 1. Dependence program DO I = 1,3 A(I) = A(I-1) ENDDO DOALL I = 1,3 A(I) = A(I-1) ENDDO

  4. 1.1 Example1 ISV directive visualize

  5. Node Iteration Flow dependence Edge Dependence order between nodes iteration • Color Dependence type:FLOW: Write Read ANTI: Read WriteOUTPUT: Write Write 1.2 Visualize the Dependence • A dependence is visualized in an iteration space dependence graph

  6. 1.3 Parallelism? • Stepwise view sequential execution • No parallelism found • However, many programs have parallelism…

  7. 2. Potential Parallelism • Time(sequential) = number of iterations • Dataflow: iterations are executed as soon as its data are readyTime(dataflow) = number of iterations on the longest critical path • The potential parallelism is denoted byspeedup = Time(sequential)/Time(dataflow)

  8. 2.1 Example 2

  9. Diophantine Equations + Loop bounds (polytope) = Iteration Space Dependencies

  10. Speedup:13.3 2.2 Irregular dependence • Dependences have non-uniform distance • Parallelism Analysis:200 iterations over 15 data flow steps Problem: How to exploit it?

  11. 3. Visualize parallelism Find answers to these questions • What is the dependence pattern? • Is there a parallel loop? (How to find?) • What is the maximal parallelism?(How to exploit it?) • Is the load of parallel tasks balanced?

  12. 3.1 Example 3

  13. 3.2 3D Space

  14. 3.3 Loop parallelizable? • The I, J, K loops are in the 3D space: 32 iterations Simulate sequential execution • Which loop can be parallel?

  15. 3.4 Loop parallelization • Interactively try the parallelization Interactively check a parallel loop I • The blinking dependence edges prevent the parallelization of the given loop I.

  16. 3.5 Parallel execution • Let ISV find the correct parallelization Automatically check the parallel loop • It takes 16 time steps Simulateparallel execution

  17. 3.6 Dataflow execution • Sequential execution takes 32 time steps Simulatedata flow execution • Dataflow execution only takes 4 times steps • Potential speedup=8.

  18. 3.7 Graph partitioning • Dataflow speedup = 8 Iterating throughpartitions: the connected components • All the partitions are load balanced

  19. 4. Loop Transformation Potential parallelism Transformation Real parallelism

  20. 4.1 Example 4

  21. 4.2 The iteration space • Sequentially 25 iterations

  22. 4.3 Loop Parallelizable? • check loop I • check loop J

  23. 4.4 Dataflow execution • Totally 9 steps • Potential speedup: 25/9=2.78 • Wave front effect:all iterations on the same wave are on the same line

  24. 4.5 Zoom-in on the I-space

  25. 4.6 Speedup vs program size • Zoom-in previews parallelism in part of a loop without modifying the program • Executing the programs of different size n estimates a speedup of n2/(2n-1)

  26. 4.7 How to obtain the potential parallelism Here we already have these metrics: • Sequential time steps = N2 • Dataflow time step = 2N-1 potential speedup = N2/(2N-1) How to obtain the potential speedup of a loop? Transformation.

  27. reversal interchange skewing 4.8 Unimodular transformation (UT) Unimodular matrix A unimodular matrix is a square integer matrix that has unit determinant. It is the result of identity matrix by three kinds of basic transformations: reversal, interchange, and skewing • The new loop execution order is determined by the transformed index. The iteration space remains unit step size • Find a suitable UT reorders the iterations such that the new loop nest has a parallel loop New loop index Old loop index

  28. The plane iteration 4.9 Hyperplane transformation • Interactively define a hyper-plane • Observe the plane iteration matches the dataflow simulation plane = dataflow • Based on the plane, ISV calculates a unimodular transformation

  29. 4.10 The derived UT The transformed iteration space and the generated loop

  30. 4.11 Verify the UT • ISV checks if the transformation is valid • Observe that the parallel loop execution in the transformed loop matches the plane execution parallel = plane

  31. 5. Statement-level parallelism • Unimodular transformations work at iteration level • The statement dependence within the loop body is hidden in the iteration space graph • How to exploit parallelism at statement level? Statement to iteration

  32. 5.1 Example 5 SSV: statement space visualization

  33. 5.2 Iteration-level parallelism • The iteration space is 2D. • There are N2=16 iterations • The dataflow execution has 2N-1=7 time steps. • The potential speedup is: 16/7 = 2.29

  34. 5.3 Parallelism in statements • The (statement) iteration space is 3D • There are 2N2=32 statements • The dataflow execution still has 2N-1=7 time steps. • The potential speedup is: 32/7 = 4.58

  35. 5.4 Comparison • Here: doubles the potential speedup at iteration level

  36. 5.5 Define the partition planes • partitions • hyper-planes

  37. What is validity? Show the execution order on top of the dependence arrows.(for 1 plane or all together, depending on the density of the slide)

  38. 5.6 Invalid UT • The invalid unimodular transformation derived from hyper-plane is refused by ISV • Alternatively, ISV calculates the unimodular transformation based on the dependence distance vectors available in the dependence graph

  39. The base vectors The unimodular matrix 6. Pseudo distance method The pseudo distance method: • Extract base vectors from the dependent iterations • Examine if the base vectors generates all the distances • Calculate the unimodular transformation based on the base vectors

  40. Another way to find parallelism automatically The iteration space is a grid,non-uniform dependencies are members of a uniform dependence grid, with unknown base-vectors. Finding these base vectors allows usto extend existing parallelizationto the non-uniform case.

  41. 6.1 Dependence distance • (1,0,-1) • (0,1,1)

  42. 6.2 The Transformation • The transforming matrix discovered by pseudo distance method 1 1 0 -1 0 1 1 0 0 • The distance vectors are transformed(1,0,-1) (0,1,0)(0,1,1) (0,0,1) • The dependent iterations have the samefirst index, implies the outermost loop is parallel.

  43. The transforming matrix discovered by pseudo distance method 1 1 0 -1 0 1 1 0 0 6.3 Compare the UT matrices • An invalid transforming matrix discovered by the hyper-plane method 1 0 0 -1 1 0 1 0 1 The same first column means the transformed outermost loops have the same index.

  44. 6.4 The transformed space • The outermost loop is parallel • There are 8 parallel tasks • The load of tasks is not balanced • The longest task takes 7 time steps

  45. 7. Non-perfectly nested loop • What is it? • The unimodular transformations only work for perfectly nested loops • For non-perfectly nested loop, the iteration space is constructed with extended indices • N fold non-perfectly nested loop to a N+1 fold perfectly nested loop

  46. 7.1 Perfectly nested Loop? Non-perfectly nested loop: DO I1 = 1,3 A(I1) = A(I1-1) DO I2 = 1,4 B(I1,I2) = B(I1-1,I2)+B(I1,I2-1) ENDDO ENDDO Perfectly nested loop: DO I1 = 1,3 DO I2 = 1,5 DO I3 = 0,1 IF (I2.EQ.1.AND.I3.EQ.0) A(I1) = A(I1-1) ELSE IF(I3.EQ.1) B(I1-1,I2)=B(I1-2,I2)+B(I1-1,I2-1) ENDDO ENDDO ENDDO

  47. 7.2 Exploit parallelism with UT

  48. 8. Applications Programs Catagory Depth Form Pattern Transformation Example 1 Tutorial 1 Perfect Uniform N/A Example 2 Tutorial 2 Perfect Non-uniform N/A Example 3 Tutorial 3 Perfect Uniform Wavefront UT Example 4 Tutorial 2 Perfect Uniform Wavefront UT Example 5 Tutorial 2+1 Perfect Uniform Stmt Partitioning UT Example 6 Tutorial 2+1 Non-perfect Uniform Wavefront UT Matrix multiplication Algorithm 3 Perfect Uniform Parallelization Gauss-Jordan Algorithm 3 Perfect Non-Uniform Parallelization FFT Algorithm 3 Perfect Non-Uniform Parallelization Cholesky Benchmark 4 Non-perfect Non-Uniform Partitioning UT TOMCATV Benchmark 3 Non-perfect Uniform Parallelization Flow3D CFD App. 3 Perfect Uniform Wavefront UT

  49. 9. Future considerations • Weighted dependence graph • More semantics on data locality: data space graph, data communication graph data reuse iteration space graph, • More loop transformation: Affine (statement) iteration space mappings Automatic statement distribution Integration with Omega library

More Related