1.15k likes | 1.28k Views
It’s Elemental! ICS’13, Eugene, OR. Bryan Marker Jack Poulson Robert van de Geijn. What will you learn. Fundamentals of collective communication A systematic approach to parallelizing Dense Linear Algebra (DLA) operations
E N D
It’s Elemental!ICS’13, Eugene, OR Bryan Marker Jack Poulson Robert van de Geijn
What will you learn • Fundamentals of collective communication • A systematic approach to parallelizing Dense Linear Algebra (DLA) operations • An instantiation of the ideas in a new distributed memory parallel DLA library: Elemental • How abstraction supports DxT, a new way of developing libraries: • Encode expert knowledge • Transform algorithms into optimized libraries
Introducing the instructors • Bryan Marker • Ph.D. candidate in CS at UT-Austin • Research focuses on automating the expert • Jack Poulson • Recent Ph.D. from CSEM at UT-Austin • Postdoc at Stanford • Soon to join GATech as a faculty member • Robert van de Geijn • Professor of Computer Science, UT-Austin • Member, ICES, UT-Austin
Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)
Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)
Collective Communication Robert van de Geijn
Resources Available at http://www.stanford.edu/~poulson/ics13/. • Link to Collective communication: theory, practice, and experience Ernie Chan, Marcel Heimlich, AviPurkayastha, Robert van de Geijn Concurrency and Computation: Practice & Experience , Volume 19 Issue 1, September 2007 • Link to a more extensive PPT presentation (with animations) • Link to these slides
What you will learn • What collective communications are important for DLA • What these operations cost • How collective communications compose 700+ slides condensed into a dozen or so…
Why Should I Care? • Collective communication is used by many distributed memory algorithms/applications • Experts and novices alike often have misconceptions • What is the cost of a broadcast of n items (bytes) among p processing nodes? • If your answer is or you are at best partially correct and will benefit from this presentation!
Model of Parallel Computation • p (processing) nodes • For simplicity, we will not consider “hybrid” nodes (e.g., one MPI process per node/core) • Indexed 0, … , p-1 • Some connection network • Linear array of nodes suffices for our explanations! • Many clusters are fully connected via some switch
Model of Parallel Computation (continued) • A node can send directly to any other node • A node can simultaneously receive and send • Cost of communication • Sending a message of length n between any two nodes costs α + n β cost per item latency amount of data
Common Collective Communications • Broadcast • Reduce(-to-one) • Scatter • Gather • Allgather • Reduce-scatter • Allreduce • All-to-all
Broadcast/Reduce(-to-one) Broadcast + + + + + + + + Reduce(-to-one)
Scatter/Gather Scatter Gather
Allgather/Reduce-scatter Allgather + + + + + + + + Reduce-scatter
Allreduce Before After + + + + + + + +
All-to-all Before After
Lower bounds • Why? • Keeps you from looking for a better implementation when none exists • Allows one to judge how good a collective communication algorithm is • Allows one to perform lower bound analyses of entire algorithms/applications
Lower Bounds Bandwidth/ computation startup • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce • All-to-all
Cost estimates • We will use the sum of the lower bounds as an estimate of the cost of communication • Within a constant factor of practical algorithms on most topologies. • Exception: all-to-all. But that communication only contributes a lower order cost in our analyses.
Composing collectives Allreduce Allgather Reduce-scatter + + + + + + + + Gather Broadcast Scatter Reduce
Allgather Allgather Allgather Allgather Allgather
Broadcast Broadcast Allgather Broadcast Broadcast
Summary • Collective communications • Widely used abstractions (building blocks) • Cost can be modeled • Composable • A sequence of collective communication can be transformed into another collective • A single collective can be transformed into a sequence • What we will see next is that • They are important in DLA • Composability is key to exposing system • Composability is key to optimization • Systematic approach is key to automation
Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)
Parallel Dense Linear Algebra Algorithms Robert van de Geijn
Resources Available at http://www.stanford.edu/~poulson/ics13/. • Link to Parallel Matrix Multiplication: 2D and 3D Martin Schatz, Jack Poulson, and Robert van de Geijn. FLAME Working Note #62. The University of Texas at Austin, Department of Computer Science. Technical Report TR-12-13. June 2012. • Link to these slides
What you will learn • Why collective communications are important for DLA • The elemental cyclic data distribution that underlies the Elemental library • A systematic approach to parallelizing DLA algorithms targeting distributed memory architectures Correct by construction
Why Should I Care? • Illustrates how collectives are used in practice. • DLA is at the bottom of the food chain • Cannon’s algorithm for matrix-matrix multiplication is not used in practice. We show practical algorithms. • Illustrates and explains what happens underneath the Elemental library. • Shows how an expert systematically parallelizes DLA algorithms. • Suggests the approach is systematic to the point where it can be automated. Correct by construction
Overview • Parallelizing matrix-vector operations • 2D parallel algorithms for matrix-matrix multiplication • Summary
Elemental 2D Distribution • View p (MPI) processes as r x c mesh • Needed for scalability • ``Wrap’’ matrix onto mesh in both directions • For many matrix computation (e.g. Cholesky factorization) the active matrix “shrinks” as the computation proceeds. Wrapping maintains load balance in this case.
Parallel Matrix-Vector Multiplication • Allgatherx within columns • Local matrix-vector multiply • Reduce-scatter y within rows • Total:
Some Terminology • Overhead • Speedup • Efficiency • Relative overhead
Asymptotic Analysis Weakly scalable for practical purposes
Parallel Rank-1 Update • Allgatherx within columns • Allgather y within rows • Local rank-1 update • Total:
Weak Scalability • Cost of parallel rank-1 update: • Is the rank-1 update algorithm weakly scalable? • Hint: Cost of weakly scalable parallel matrix-vector multiply:
Of Vectors, Rows, and Columns • Often the vectors in a matvec or rank-1 update show up as rows or columns of a matrix
Of Vectors, Rows, and Columns • Often the vectors in a matvec or rank-1 update show up as rows or columns of a matrix