240 likes | 375 Views
Optimizing Collective Communication for Multicore. By Rajesh Nishtala. What Are Collectives. An operation called by all threads together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are communicated
E N D
Optimizing Collective Communication for Multicore By Rajesh Nishtala
What Are Collectives • An operation called by all threads together to perform globally coordinated communication • May involve a modest amount of computation, e.g. to combine values as they are communicated • Can be extended to teams (or communicators) in which they operate on a predefined subset of the threads • Focus on collectives in Single Program Multiple Data (SPMD) programming models Multicore Collective Tuning
Some Collectives • Barrier((MPI_Barrier()) • A thread cannot exit a call to a barrier until all other threads have called the barrier • Broadcast (MPI_Bcast()) • A root thread sends a copy of an array to all the other threads • Reduce-To-All (MPI_Allreduce()) • Each thread contributes an operand to an arithmetic operation across all the threads • The result is then broadcast to all the threads • Exchange (MPI_Alltoall()) • For all i, j < N , thread i copies the jthpiece of an input array to the ith slot of an output array located on thread i. Multicore Collective Tuning
Why Are They Important? • Basic communication building blocks • Found in many parallel programming languages and libraries • Abstraction • If an application is written with collectives, passes the responsibility of tuning to the runtime Percentage of runtime spent in collectives Multicore Collective Tuning
Experimental Setup • Platforms • Sun Niagra2 • 1 socket of 8 multi-threaded cores • Each core supports 8 hardware thread contexts for 64 total threads • Intel Clovertown • 2 “traditional” quad-core sockets • BlueGene/P • 1 quad-core socket • MPI for Inter-process communication • shared memory MPICH2 1.0.7 Multicore Collective Tuning
Threads v. Processes (Niagra2) • Barrier Performance • Perform a barrier across all 64 threads • Threads arranged into processes in different ways • One extreme has one thread per process while other has 1 process with 64 threads • MPI_Barrier() called between processes • Flat barrier amongst threads • 2 orders of magnitude difference in performance! Multicore Collective Tuning
Threads v. Processes (Niagra2) cont. • Other collectives see similar issues with scaling using processes • MPI Collectives called between processes while shared memory is leverage within a process Multicore Collective Tuning
Intel Clovertown and BlueGene/P • Less threads per node • Differences are not as drastic but they are non-trivial Intel Clovertown BlueGene/P Multicore Collective Tuning
Optimizing Barrier w/ Trees 0 8 4 2 1 • Leveraging shared memory is a critical optimization • Flat trees are don’t scale • Use to aid parallelism • Requires two passes of a tree • First (UP) pass indicates that all threads have arrived. • Signal parent when all your children have arrived • Once root gets signal from all children then all threads have reported in • Second (DOWN) pass indicates that all threads have arrived • Wait for parent to send me a clear signal • Propagate clear signal down to my children 12 10 9 6 5 3 14 13 11 7 15 Multicore Collective Tuning
Example Tree Topologies 0 8 3 2 1 4 12 11 10 9 7 6 5 15 14 13 0 8 7 6 5 4 3 2 1 15 14 13 12 11 10 9 0 8 4 2 1 12 10 9 6 5 3 14 13 11 7 15 Radix 2 k-nomial tree (binomial) Radix 4 k-nomial tree (quadnomial) Radix 8 k-nomial tree (octnomial) Multicore Collective Tuning
Barrier Performance Results • Time many back-to-back barriers • Flat tree is just one level with all threads reporting to thread 0 • Leverages shared memory but non-scalable • Architecture Independent Tree (radix=2) • Pick a generic “good” radix that is suitable for many platforms • Mismatched to architecture • Architecture Dependent Tree • Search overall radices to pick the tree that best matches the architecture G O O D Multicore Collective Tuning
Broadcast Performance Results • Time a latency sensitive Broadcast (8 Bytes) • Time Broadcast followed by Barrier and subtract time for Barrier • Yields an approximation for how long it takes for the last thread to get the data G O O D Multicore Collective Tuning
Reduce To All Performance Results • 4kBytes (512 Doubles) Reduce-To-All • In addition to data movement we also want to parallelize the computation • In Flat approach, computation gets serialized at the root • Tree based approaches allow us to parallelize the computation amongst all the floating point units • 8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way G O O D Multicore Collective Tuning
Optimization Summary • Relying on flat trees is not enough for most collectives • Architecture dependent tuning is a further and important optimization G O O D Multicore Collective Tuning
Extending the Results to a Cluster • Use one-rack of BlueGene/P (1024 nodes or 4096 cores) • Reduce-To-All by having one thread representative thread make call to inter-node all reduce • Reduce the number of messages in the network • Vary the number of threads per process but use all cores • Relying purely on shared memory doesn’t always yield the best performance • Reduces number of active cores working on computation drops • Can optimize so that computation is partitioned across cores • Not suitable for direct call to MPI_Allreduce() Multicore Collective Tuning
pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 0 x: 1 pid: 0 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 0 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: 5 pid: 1 x: 1 pid: 1 x: Ø pid: 1 x: Ø pid: 2 x: Ø pid: 2 x: 1 pid: 2 x: Ø pid: 2 x: 1 pid: 1 x: Ø pid: 1 x: 1 pid: 1 x: 1 pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: Ø pid: 3 x: 1 pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: 1 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 1 pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 3 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 5 pid: 4 x: 1 pid: 4 x: 5 pid: 4 x: Ø pid: 4 x: Ø pid: 4 x: 1 Potential Synchronization Problem 1. Broadcast variable x from root 2. Have proc 1 set a new value for x on proc 4 broadcast x=1 from proc 0 if(myid==1) { put x=5 to proc 4 } else { /* do nothing*/ } Proc 1 thinks collective is done Put of x=5 by proc 1 has been lost Proc 1 observes globally incomplete collective Multicore Collective Tuning
Strict v. Loose Synchronization • A fix to the problem • Add barrier before/after the collective • Enforces global ordering of the operations • Is there a problem? • We want to decouple synchronization from data movement • Specify the synchronization requirements • Potential to aggregate synchronization • Done by the user ora smart compiler • How can we realize these gains in applications? Multicore Collective Tuning
Conclusions • Processes Threads is a crucial optimization for single-node collective communication • Can use tree-based collectives to realize better performance, even for collectives on one node • Picking the correct tree that best matches architecture yields the best performance • Multicore adds to the (auto)tuning space for collective communication • Shared memory semantics allow us to create new loosely synchronized collectives Multicore Collective Tuning
Questions? Multicore Collective Tuning
Backup Slides Multicore Collective Tuning
Threads and Processes • Threads • A sequence of instructions and an execution stack • Communication between threads through common and shared address space • No OS/Network involvement needed • Reasoning about inter-thread communication can be tricky • Processes • A set of threads and and an associated memory space • All threads within process share address space • Communication between processes must be managed through the OS • Inter-process communication is explicit but may be slow • More expensive to switch between processes Multicore Collective Tuning
Experimental Platforms Clovertown Niagra2 BG/P Multicore Collective Tuning
Specs Multicore Collective Tuning
Details of Signaling • For optimum performance have many readers and one writer • Each thread sets a flag (a single word) that others will read • Every reader will get a copy of the cache-line and spin on that copy • When writer comes in and changes value of variable, cache-coherency system will handle broadcasting/updating the changes • Avoid atomic primitives • On way up the tree, child sets a flag indicating that subtree has arrived • Parent spins on that flag for each child • On way down, each child spins on parent’s flag • When it’s set, it indicates that the parent wants to broadcast the clear signal down • Flags must be on different cache lines to avoid false sharing • Need to switch back-and-forth between two sets of flags Multicore Collective Tuning