On Optimizing Collective Communication

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn ScicomP 10, August 9-13 Austin, TX

Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work

Model of Parallel Computation • Target Architectures • distributed memory parallel architectures • Indexing • p nodes • indexed 0 … p – 1 • each node has one computational processor

0 1 2 3 4 • often logically viewed as a linear array 5 6 7 8

Model of Parallel Computation • Logically Fully Connected • a node can send directly to any other node • Communicating Between Nodes • a node can simultaneously receive and send • Network Conflicts • sending over a path between two nodes that is completely occupied

Model of Parallel Computation • Cost of Communication • sending a message of length n between any two nodes •  is the startup cost (latency) •  is the transmission cost (bandwidth) • Cost of Computation • cost to perform arithmetic operation is  • reduction operations • sum • prod • min • max  +n

Collective Communications • Broadcast • Reduce(-to-one) • Scatter • Gather • Allgather • Reduce-scatter • Allreduce

Lower Bounds (Latency) • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce

Lower Bounds (Bandwidth) • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce

Motivating Example • We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.

A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms

Short-vector case • Primary concern: • algorithms must have low latency cost • Secondary concerns: • algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes • algorithms should avoid network conflicts • not absolutely necessary, but nice if possible

Minimum-Spanning Tree based algorithms • We will show how the following building blocks: • broadcast/reduce • scatter/gather • Using minimum spanning trees embedded in the logical linear array while attaining • minimal latency • implementation for arbitrary numbers of nodes • no network conflicts

General principles • message starts on one processor

General principles • divide logical linear array in half

General principles • send message to the half of the network that does not contain the current node (root) that holds the message

General principles • continue recursively in each of the two halves

General principles • The demonstrated technique directly applies to • broadcast • scatter • The technique can be applied in reverse to • reduce • gather

General principles • This technique can be used to implement the following building blocks: • broadcast/reduce • scatter/gather • Using a minimum spanning tree embedded in the logical linear array while attaining • minimal latency • implementation for arbitrary numbers of nodes • no network conflicts? • Yes, on linear arrays

Reduce-scatter (short vector)

Reduce-scatter (short vector) Reduce

Reduce-scatter (short vector) Scatter

Reduce Before Reduce + + + + + + + +

Reduce Reduce After + + + + + + + +

Cost of Minimum-Spanning Tree Reduce number of steps cost per steps

Cost of Minimum-Spanning Tree Reduce number of steps cost per steps Notice: attains lower bound for latency component

Scatter Before After

Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes

Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes Notice: attains lower bound for latency and bandwidth components

Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes reduce scatter

Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes reduce scatter Notice: does not attain lower bound for latency or bandwidth components

Recap Reduce Reduce-scatter Scatter Allreduce Gather Allgather Broadcast

A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms

Long-vector case • Primary concern: • algorithms must have low cost due to vector length • algorithms must avoid network conflicts • Secondary concerns: • algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes

Long-vector building blocks • We will show how the following building blocks: • allgather/reduce-scatter • Can be implemented using “bucket” algorithms while attaining • minimal cost due to length of vectors • implementation for arbitrary numbers of nodes • no network conflicts

On Optimizing Collective Communication

On Optimizing Collective Communication

Presentation Transcript

Communication in Collective Action

Collective Communication Implementations

Simulating Collective Effects on GPUs

MapReduce , Collective Communication, and Services

Collective

Building on our Collective Experience

Collective Communication

Collective bargaining on general agreement

OPTIMIZING OUTCOMES ON PERITONEAL DIALYSIS:

MPI Collective Communication

Optimizing Collective Communication for Multicore

Optimizing Applications on Blue Waters

Collective Communication

Comparing Topology based Collective Communication Algorithms

Thoughts on Collective Creativity

Update on Collective Work Descriptions

Communication in Collective Action

Optimizing Collective Communication for Multicore

MPI implementation – collective communication

Collective Communication

Collective Communication

Design an MPI collective communication scheme