1 / 115

It’s Elemental! ICS’13, Eugene, OR

It’s Elemental! ICS’13, Eugene, OR. Bryan Marker Jack Poulson Robert van de Geijn. What will you learn. Fundamentals of collective communication A systematic approach to parallelizing Dense Linear Algebra (DLA) operations

toril
Download Presentation

It’s Elemental! ICS’13, Eugene, OR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. It’s Elemental!ICS’13, Eugene, OR Bryan Marker Jack Poulson Robert van de Geijn

  2. What will you learn • Fundamentals of collective communication • A systematic approach to parallelizing Dense Linear Algebra (DLA) operations • An instantiation of the ideas in a new distributed memory parallel DLA library: Elemental • How abstraction supports DxT, a new way of developing libraries: • Encode expert knowledge • Transform algorithms into optimized libraries

  3. Introducing the instructors • Bryan Marker • Ph.D. candidate in CS at UT-Austin • Research focuses on automating the expert • Jack Poulson • Recent Ph.D. from CSEM at UT-Austin • Postdoc at Stanford • Soon to join GATech as a faculty member • Robert van de Geijn • Professor of Computer Science, UT-Austin • Member, ICES, UT-Austin

  4. Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)

  5. Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)

  6. Collective Communication Robert van de Geijn

  7. Resources Available at http://www.stanford.edu/~poulson/ics13/. • Link to Collective communication: theory, practice, and experience Ernie Chan, Marcel Heimlich, AviPurkayastha, Robert van de Geijn Concurrency and Computation: Practice & Experience , Volume 19 Issue 1, September 2007 • Link to a more extensive PPT presentation (with animations) • Link to these slides

  8. What you will learn • What collective communications are important for DLA • What these operations cost • How collective communications compose 700+ slides condensed into a dozen or so…

  9. Why Should I Care? • Collective communication is used by many distributed memory algorithms/applications • Experts and novices alike often have misconceptions • What is the cost of a broadcast of n items (bytes) among p processing nodes? • If your answer is or you are at best partially correct and will benefit from this presentation!

  10. Model of Parallel Computation • p (processing) nodes • For simplicity, we will not consider “hybrid” nodes (e.g., one MPI process per node/core) • Indexed 0, … , p-1 • Some connection network • Linear array of nodes suffices for our explanations! • Many clusters are fully connected via some switch

  11. Model of Parallel Computation (continued) • A node can send directly to any other node • A node can simultaneously receive and send • Cost of communication • Sending a message of length n between any two nodes costs α + n β cost per item latency amount of data

  12. Common Collective Communications • Broadcast • Reduce(-to-one) • Scatter • Gather • Allgather • Reduce-scatter • Allreduce • All-to-all

  13. Broadcast/Reduce(-to-one) Broadcast + + + + + + + + Reduce(-to-one)

  14. Scatter/Gather Scatter Gather

  15. Allgather/Reduce-scatter Allgather + + + + + + + + Reduce-scatter

  16. Allreduce Before After + + + + + + + +

  17. All-to-all Before After

  18. Lower bounds • Why? • Keeps you from looking for a better implementation when none exists • Allows one to judge how good a collective communication algorithm is • Allows one to perform lower bound analyses of entire algorithms/applications

  19. Lower Bounds Bandwidth/ computation startup • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce • All-to-all

  20. Cost estimates • We will use the sum of the lower bounds as an estimate of the cost of communication • Within a constant factor of practical algorithms on most topologies. • Exception: all-to-all. But that communication only contributes a lower order cost in our analyses.

  21. Composing collectives Allreduce Allgather Reduce-scatter + + + + + + + + Gather Broadcast Scatter Reduce

  22. Allgather Allgather Allgather Allgather Allgather

  23. Broadcast Broadcast Allgather Broadcast Broadcast

  24. Summary • Collective communications • Widely used abstractions (building blocks) • Cost can be modeled • Composable • A sequence of collective communication can be transformed into another collective • A single collective can be transformed into a sequence • What we will see next is that • They are important in DLA • Composability is key to exposing system • Composability is key to optimization • Systematic approach is key to automation

  25. Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)

  26. Parallel Dense Linear Algebra Algorithms Robert van de Geijn

  27. Resources Available at http://www.stanford.edu/~poulson/ics13/. • Link to Parallel Matrix Multiplication: 2D and 3D Martin Schatz, Jack Poulson, and Robert van de Geijn. FLAME Working Note #62. The University of Texas at Austin, Department of Computer Science. Technical Report TR-12-13. June 2012. • Link to these slides

  28. What you will learn • Why collective communications are important for DLA • The elemental cyclic data distribution that underlies the Elemental library • A systematic approach to parallelizing DLA algorithms targeting distributed memory architectures Correct by construction

  29. Why Should I Care? • Illustrates how collectives are used in practice. • DLA is at the bottom of the food chain • Cannon’s algorithm for matrix-matrix multiplication is not used in practice. We show practical algorithms. • Illustrates and explains what happens underneath the Elemental library. • Shows how an expert systematically parallelizes DLA algorithms. • Suggests the approach is systematic to the point where it can be automated. Correct by construction

  30. Overview • Parallelizing matrix-vector operations • 2D parallel algorithms for matrix-matrix multiplication • Summary

  31. Elemental 2D Distribution • View p (MPI) processes as r x c mesh • Needed for scalability • ``Wrap’’ matrix onto mesh in both directions • For many matrix computation (e.g. Cholesky factorization) the active matrix “shrinks” as the computation proceeds. Wrapping maintains load balance in this case.

  32. Parallel Matrix-Vector Multiplication

  33. Parallel Matrix-Vector Multiplication • Allgatherx within columns • Local matrix-vector multiply • Reduce-scatter y within rows • Total:

  34. Some Terminology • Overhead • Speedup • Efficiency • Relative overhead

  35. Asymptotic Analysis Weakly scalable for practical purposes

  36. Parallel Rank-1 Update

  37. Parallel Rank-1 Update • Allgatherx within columns • Allgather y within rows • Local rank-1 update • Total:

  38. Weak Scalability • Cost of parallel rank-1 update: • Is the rank-1 update algorithm weakly scalable? • Hint: Cost of weakly scalable parallel matrix-vector multiply:

  39. Of Vectors, Rows, and Columns • Often the vectors in a matvec or rank-1 update show up as rows or columns of a matrix

  40. Of Vectors, Rows, and Columns • Often the vectors in a matvec or rank-1 update show up as rows or columns of a matrix

More Related