It’s Elemental! ICS’13, Eugene, OR

It’s Elemental!ICS’13, Eugene, OR Bryan Marker Jack Poulson Robert van de Geijn

What will you learn • Fundamentals of collective communication • A systematic approach to parallelizing Dense Linear Algebra (DLA) operations • An instantiation of the ideas in a new distributed memory parallel DLA library: Elemental • How abstraction supports DxT, a new way of developing libraries: • Encode expert knowledge • Transform algorithms into optimized libraries

Introducing the instructors • Bryan Marker • Ph.D. candidate in CS at UT-Austin • Research focuses on automating the expert • Jack Poulson • Recent Ph.D. from CSEM at UT-Austin • Postdoc at Stanford • Soon to join GATech as a faculty member • Robert van de Geijn • Professor of Computer Science, UT-Austin • Member, ICES, UT-Austin

Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)

Collective Communication Robert van de Geijn

Resources Available at http://www.stanford.edu/~poulson/ics13/. • Link to Collective communication: theory, practice, and experience Ernie Chan, Marcel Heimlich, AviPurkayastha, Robert van de Geijn Concurrency and Computation: Practice & Experience , Volume 19 Issue 1, September 2007 • Link to a more extensive PPT presentation (with animations) • Link to these slides

What you will learn • What collective communications are important for DLA • What these operations cost • How collective communications compose 700+ slides condensed into a dozen or so…

Why Should I Care? • Collective communication is used by many distributed memory algorithms/applications • Experts and novices alike often have misconceptions • What is the cost of a broadcast of n items (bytes) among p processing nodes? • If your answer is or you are at best partially correct and will benefit from this presentation!

Model of Parallel Computation • p (processing) nodes • For simplicity, we will not consider “hybrid” nodes (e.g., one MPI process per node/core) • Indexed 0, … , p-1 • Some connection network • Linear array of nodes suffices for our explanations! • Many clusters are fully connected via some switch

Model of Parallel Computation (continued) • A node can send directly to any other node • A node can simultaneously receive and send • Cost of communication • Sending a message of length n between any two nodes costs α + n β cost per item latency amount of data

Common Collective Communications • Broadcast • Reduce(-to-one) • Scatter • Gather • Allgather • Reduce-scatter • Allreduce • All-to-all

Broadcast/Reduce(-to-one) Broadcast + + + + + + + + Reduce(-to-one)

Scatter/Gather Scatter Gather

Allgather/Reduce-scatter Allgather + + + + + + + + Reduce-scatter

Allreduce Before After + + + + + + + +

All-to-all Before After

Lower bounds • Why? • Keeps you from looking for a better implementation when none exists • Allows one to judge how good a collective communication algorithm is • Allows one to perform lower bound analyses of entire algorithms/applications

Lower Bounds Bandwidth/ computation startup • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce • All-to-all

Cost estimates • We will use the sum of the lower bounds as an estimate of the cost of communication • Within a constant factor of practical algorithms on most topologies. • Exception: all-to-all. But that communication only contributes a lower order cost in our analyses.

Composing collectives Allreduce Allgather Reduce-scatter + + + + + + + + Gather Broadcast Scatter Reduce

Allgather Allgather Allgather Allgather Allgather

Broadcast Broadcast Allgather Broadcast Broadcast

Summary • Collective communications • Widely used abstractions (building blocks) • Cost can be modeled • Composable • A sequence of collective communication can be transformed into another collective • A single collective can be transformed into a sequence • What we will see next is that • They are important in DLA • Composability is key to exposing system • Composability is key to optimization • Systematic approach is key to automation

Agenda • 8:30-8:35 Introduction (van de Geijn) • 8:35-8:50 Collective communication (van de Geijn) • 8:50-9:40 Parallelizing DLA algorithms (van de Geijn) • 9:40-10:00 Using Elemental (Poulson) • 10:00-10:30 Break • 10:30-11:05 Using Elemental (cont.) (Poulson) • 11:05-12:00 Design by Transformation (Marker)

Parallel Dense Linear Algebra Algorithms Robert van de Geijn

Resources Available at http://www.stanford.edu/~poulson/ics13/. • Link to Parallel Matrix Multiplication: 2D and 3D Martin Schatz, Jack Poulson, and Robert van de Geijn. FLAME Working Note #62. The University of Texas at Austin, Department of Computer Science. Technical Report TR-12-13. June 2012. • Link to these slides

What you will learn • Why collective communications are important for DLA • The elemental cyclic data distribution that underlies the Elemental library • A systematic approach to parallelizing DLA algorithms targeting distributed memory architectures Correct by construction

Why Should I Care? • Illustrates how collectives are used in practice. • DLA is at the bottom of the food chain • Cannon’s algorithm for matrix-matrix multiplication is not used in practice. We show practical algorithms. • Illustrates and explains what happens underneath the Elemental library. • Shows how an expert systematically parallelizes DLA algorithms. • Suggests the approach is systematic to the point where it can be automated. Correct by construction

Overview • Parallelizing matrix-vector operations • 2D parallel algorithms for matrix-matrix multiplication • Summary

Elemental 2D Distribution • View p (MPI) processes as r x c mesh • Needed for scalability • ``Wrap’’ matrix onto mesh in both directions • For many matrix computation (e.g. Cholesky factorization) the active matrix “shrinks” as the computation proceeds. Wrapping maintains load balance in this case.

Parallel Matrix-Vector Multiplication

Parallel Matrix-Vector Multiplication • Allgatherx within columns • Local matrix-vector multiply • Reduce-scatter y within rows • Total:

Some Terminology • Overhead • Speedup • Efficiency • Relative overhead

Asymptotic Analysis Weakly scalable for practical purposes

Parallel Rank-1 Update

Parallel Rank-1 Update • Allgatherx within columns • Allgather y within rows • Local rank-1 update • Total:

Weak Scalability • Cost of parallel rank-1 update: • Is the rank-1 update algorithm weakly scalable? • Hint: Cost of weakly scalable parallel matrix-vector multiply:

Of Vectors, Rows, and Columns • Often the vectors in a matvec or rank-1 update show up as rows or columns of a matrix

It’s Elemental! ICS’13, Eugene, OR

It’s Elemental! ICS’13, Eugene, OR

Presentation Transcript

Pressure Enthalpy without Tears

BORON AND BORON IN TURKEY

Diastolic Heart Failure

Eugene R. Schiff, MD, MACP, FRCP, MACG, AGAF Program Chair Leonard Miller Professor of Medicine University of Miami Scho

Chapter 13 Properties of Solutions

Agnes Ringwald

Eugene S. Takle Department of Agronomy Department of Geological and Atmospheric Science Director, Climate Science Progr

Chapter 8 Concepts of Chemical Bonding

Chemical Bonding and Nomenclature

Nomenclature of Inorganic Compounds

Capillary electrophoresis

Intro 1

Web of Science 在科研中的价值与应用

Whitings – Source of Black Gold?

Modern Review

Chemistry, The Central Science , 10th edition

Chemical Bonding and Nomenclature

MASS SPECTROSCOPY

Eugene Demler Harvard University