Optimizing Collective Communication for Multicore

Optimizing Collective Communication for Multicore By Rajesh Nishtala

What Are Collectives • An operation called by all threads together to perform globally coordinated communication • May involve a modest amount of computation, e.g. to combine values as they are communicated • Can be extended to teams (or communicators) in which they operate on a predefined subset of the threads • Focus on collectives in Single Program Multiple Data (SPMD) programming models Multicore Collective Tuning

Some Collectives • Barrier((MPI_Barrier()) • A thread cannot exit a call to a barrier until all other threads have called the barrier • Broadcast (MPI_Bcast()) • A root thread sends a copy of an array to all the other threads • Reduce-To-All (MPI_Allreduce()) • Each thread contributes an operand to an arithmetic operation across all the threads • The result is then broadcast to all the threads • Exchange (MPI_Alltoall()) • For all i, j < N , thread i copies the jthpiece of an input array to the ith slot of an output array located on thread i. Multicore Collective Tuning

Why Are They Important? • Basic communication building blocks • Found in many parallel programming languages and libraries • Abstraction • If an application is written with collectives, passes the responsibility of tuning to the runtime Percentage of runtime spent in collectives Multicore Collective Tuning

Experimental Setup • Platforms • Sun Niagra2 • 1 socket of 8 multi-threaded cores • Each core supports 8 hardware thread contexts for 64 total threads • Intel Clovertown • 2 “traditional” quad-core sockets • BlueGene/P • 1 quad-core socket • MPI for Inter-process communication • shared memory MPICH2 1.0.7 Multicore Collective Tuning

Threads v. Processes (Niagra2) • Barrier Performance • Perform a barrier across all 64 threads • Threads arranged into processes in different ways • One extreme has one thread per process while other has 1 process with 64 threads • MPI_Barrier() called between processes • Flat barrier amongst threads • 2 orders of magnitude difference in performance! Multicore Collective Tuning

Threads v. Processes (Niagra2) cont. • Other collectives see similar issues with scaling using processes • MPI Collectives called between processes while shared memory is leverage within a process Multicore Collective Tuning

Intel Clovertown and BlueGene/P • Less threads per node • Differences are not as drastic but they are non-trivial Intel Clovertown BlueGene/P Multicore Collective Tuning

Optimizing Barrier w/ Trees 0 8 4 2 1 • Leveraging shared memory is a critical optimization • Flat trees are don’t scale • Use to aid parallelism • Requires two passes of a tree • First (UP) pass indicates that all threads have arrived. • Signal parent when all your children have arrived • Once root gets signal from all children then all threads have reported in • Second (DOWN) pass indicates that all threads have arrived • Wait for parent to send me a clear signal • Propagate clear signal down to my children 12 10 9 6 5 3 14 13 11 7 15 Multicore Collective Tuning

Example Tree Topologies 0 8 3 2 1 4 12 11 10 9 7 6 5 15 14 13 0 8 7 6 5 4 3 2 1 15 14 13 12 11 10 9 0 8 4 2 1 12 10 9 6 5 3 14 13 11 7 15 Radix 2 k-nomial tree (binomial) Radix 4 k-nomial tree (quadnomial) Radix 8 k-nomial tree (octnomial) Multicore Collective Tuning

Barrier Performance Results • Time many back-to-back barriers • Flat tree is just one level with all threads reporting to thread 0 • Leverages shared memory but non-scalable • Architecture Independent Tree (radix=2) • Pick a generic “good” radix that is suitable for many platforms • Mismatched to architecture • Architecture Dependent Tree • Search overall radices to pick the tree that best matches the architecture G O O D Multicore Collective Tuning

Broadcast Performance Results • Time a latency sensitive Broadcast (8 Bytes) • Time Broadcast followed by Barrier and subtract time for Barrier • Yields an approximation for how long it takes for the last thread to get the data G O O D Multicore Collective Tuning

Reduce To All Performance Results • 4kBytes (512 Doubles) Reduce-To-All • In addition to data movement we also want to parallelize the computation • In Flat approach, computation gets serialized at the root • Tree based approaches allow us to parallelize the computation amongst all the floating point units • 8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way G O O D Multicore Collective Tuning

Optimization Summary • Relying on flat trees is not enough for most collectives • Architecture dependent tuning is a further and important optimization G O O D Multicore Collective Tuning

Extending the Results to a Cluster • Use one-rack of BlueGene/P (1024 nodes or 4096 cores) • Reduce-To-All by having one thread representative thread make call to inter-node all reduce • Reduce the number of messages in the network • Vary the number of threads per process but use all cores • Relying purely on shared memory doesn’t always yield the best performance • Reduces number of active cores working on computation drops • Can optimize so that computation is partitioned across cores • Not suitable for direct call to MPI_Allreduce() Multicore Collective Tuning

pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 0 x: 1 pid: 0 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: 5 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 2 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: 1 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 1 Potential Synchronization Problem 1. Broadcast variable x from root 2. Have proc 1 set a new value for x on proc 4 broadcast x=1 from proc 0 if(myid==1) { put x=5 to proc 4 } else { /* do nothing*/ } Proc 1 thinks collective is done Put of x=5 by proc 1 has been lost Proc 1 observes globally incomplete collective Multicore Collective Tuning

Strict v. Loose Synchronization • A fix to the problem • Add barrier before/after the collective • Enforces global ordering of the operations • Is there a problem? • We want to decouple synchronization from data movement • Specify the synchronization requirements • Potential to aggregate synchronization • Done by the user ora smart compiler • How can we realize these gains in applications? Multicore Collective Tuning

Conclusions • Processes  Threads is a crucial optimization for single-node collective communication • Can use tree-based collectives to realize better performance, even for collectives on one node • Picking the correct tree that best matches architecture yields the best performance • Multicore adds to the (auto)tuning space for collective communication • Shared memory semantics allow us to create new loosely synchronized collectives Multicore Collective Tuning

Questions? Multicore Collective Tuning

Backup Slides Multicore Collective Tuning

Threads and Processes • Threads • A sequence of instructions and an execution stack • Communication between threads through common and shared address space • No OS/Network involvement needed • Reasoning about inter-thread communication can be tricky • Processes • A set of threads and and an associated memory space • All threads within process share address space • Communication between processes must be managed through the OS • Inter-process communication is explicit but may be slow • More expensive to switch between processes Multicore Collective Tuning

Experimental Platforms Clovertown Niagra2 BG/P Multicore Collective Tuning

Specs Multicore Collective Tuning

Details of Signaling • For optimum performance have many readers and one writer • Each thread sets a flag (a single word) that others will read • Every reader will get a copy of the cache-line and spin on that copy • When writer comes in and changes value of variable, cache-coherency system will handle broadcasting/updating the changes • Avoid atomic primitives • On way up the tree, child sets a flag indicating that subtree has arrived • Parent spins on that flag for each child • On way down, each child spins on parent’s flag • When it’s set, it indicates that the parent wants to broadcast the clear signal down • Flags must be on different cache lines to avoid false sharing • Need to switch back-and-forth between two sets of flags Multicore Collective Tuning

Optimizing Collective Communication for Multicore

Optimizing Collective Communication for Multicore

Presentation Transcript

Communication in Collective Action

On-Chip Optical Communication for Multicore Processors

Collective Communication Implementations

Automatically Tuning Collective Communication for One-Sided Programming Models

Multicore for Science

MapReduce , Collective Communication, and Services

Techniques for Multicore Thermal Management

Collective Communication

Optimizing for Television

Optimizing for Television

MPI Collective Communication

Collective Communication

Comparing Topology based Collective Communication Algorithms

Synchronization Methods for Multicore Programming

Communication in Collective Action

Optimizing Collective Communication for Multicore

MPI implementation – collective communication

Collective Communication

Collective Communication

Design an MPI collective communication scheme