980 likes | 1.16k Views
On Optimizing Collective Communication. UT/Texas Advanced Computing Center UT/Computer Science. Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn. ScicomP 10, August 9-13 Austin, TX. Outline. Model of Parallel Computation Collective Communications Algorithms
E N D
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de Geijn ScicomP 10, August 9-13 Austin, TX
Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work
Model of Parallel Computation • Target Architectures • distributed memory parallel architectures • Indexing • p nodes • indexed 0 … p – 1 • each node has one computational processor
0 1 2 3 4 • often logically viewed as a linear array 5 6 7 8
Model of Parallel Computation • Logically Fully Connected • a node can send directly to any other node • Communicating Between Nodes • a node can simultaneously receive and send • Network Conflicts • sending over a path between two nodes that is completely occupied
Model of Parallel Computation • Cost of Communication • sending a message of length n between any two nodes • is the startup cost (latency) • is the transmission cost (bandwidth) • Cost of Computation • cost to perform arithmetic operation is • reduction operations • sum • prod • min • max +n
Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work
Collective Communications • Broadcast • Reduce(-to-one) • Scatter • Gather • Allgather • Reduce-scatter • Allreduce
Lower Bounds (Latency) • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce
Lower Bounds (Bandwidth) • Broadcast • Reduce(-to-one) • Scatter/Gather • Allgather • Reduce-scatter • Allreduce
Outline • Model of Parallel Computation • Collective Communications • Algorithms • Performance Results • Conclusions and Future work
Motivating Example • We will illustrate the different types of algorithms and implementations using the Reduce-scatter operation.
A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms
Short-vector case • Primary concern: • algorithms must have low latency cost • Secondary concerns: • algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes • algorithms should avoid network conflicts • not absolutely necessary, but nice if possible
Minimum-Spanning Tree based algorithms • We will show how the following building blocks: • broadcast/reduce • scatter/gather • Using minimum spanning trees embedded in the logical linear array while attaining • minimal latency • implementation for arbitrary numbers of nodes • no network conflicts
General principles • message starts on one processor
General principles • divide logical linear array in half
General principles • send message to the half of the network that does not contain the current node (root) that holds the message
General principles • send message to the half of the network that does not contain the current node (root) that holds the message
General principles • continue recursively in each of the two halves
General principles • The demonstrated technique directly applies to • broadcast • scatter • The technique can be applied in reverse to • reduce • gather
General principles • This technique can be used to implement the following building blocks: • broadcast/reduce • scatter/gather • Using a minimum spanning tree embedded in the logical linear array while attaining • minimal latency • implementation for arbitrary numbers of nodes • no network conflicts? • Yes, on linear arrays
Reduce-scatter (short vector) Scatter
Reduce Before Reduce + + + + + + + +
Reduce Reduce After + + + + + + + +
Cost of Minimum-Spanning Tree Reduce number of steps cost per steps
Cost of Minimum-Spanning Tree Reduce number of steps cost per steps Notice: attains lower bound for latency component
Scatter Before After
Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes
Cost of Minimum-Spanning Tree Scatter • Assumption: power of two number of nodes Notice: attains lower bound for latency and bandwidth components
Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes reduce scatter
Cost of Reduce/Scatter Reduce-scatter • Assumption: power of two number of nodes reduce scatter Notice: does not attain lower bound for latency or bandwidth components
Recap Reduce Reduce-scatter Scatter Allreduce Gather Allgather Broadcast
A building block approach to library implementation • Short-vector case • Long-vector case • Hybrid algorithms
Long-vector case • Primary concern: • algorithms must have low cost due to vector length • algorithms must avoid network conflicts • Secondary concerns: • algorithms must work for arbitrary number of nodes • in particular, not just for power-of-two numbers of nodes
Long-vector building blocks • We will show how the following building blocks: • allgather/reduce-scatter • Can be implemented using “bucket” algorithms while attaining • minimal cost due to length of vectors • implementation for arbitrary numbers of nodes • no network conflicts