120 likes | 236 Views
Optimization of Collective Communication in Intra-Cell MPI. Ashok Srinivasan Florida State University asriniva@cs.fsu.edu. Goals Efficient implementation of collectives for intra-Cell MPI Evaluate the impact of different algorithms on the performance.
E N D
Optimization of Collective Communication in Intra-Cell MPI Ashok Srinivasan Florida State University asriniva@cs.fsu.edu Goals Efficient implementation of collectives for intra-Cell MPI Evaluate the impact of different algorithms on the performance Collaborators:A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1, S. Kapoor2 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin Acknowledgment: IBM, for providing access to a Cell blade under the VLP program
Outline • Cell Architecture • Intra-Cell MPI Design Choices • Barrier • Broadcast • Reduce • Conclusions and Future Work
Cell Architecture DMA put times • Memory to Memory Copy using: • SPE local store • memcpy by PPE • A PowerPC core, with 8 co-processors (SPE) with 256 K local store each • Shared 512 MB - 2 GB main memory - SPEs can DMA • Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops in double precision for SPEs • 204.8 GB/s EIB bandwidth, 25.6 GB/s for memory • Two Cell processors can be combined to form a Cell blade with global shared memory
Intra-Cell MPI Design Choices • Cell features • In order execution, but DMAs can be out of order • Over 100 simultaneous DMAs can be in flight • Constraints • Unconventional, heterogeneous architecture • SPEs have limited functionality, and can act directly only on local stores • SPEs access main memory through DMA • Use of PPE should be limited to get good performance • MPI design choices • Application data in: (i) local store or (ii) main memory • MPI data in: (i) local store or (ii) main memory • PPE involvement: (i) active or (ii) only during initialization and finalization • Collective calls can: (i) synchronize or (ii) not synchronize
Barrier (1) • OTA List: “Root” receives notification from all others, and then acknowledges through a DMA list • OTA: Like OTA List, but root notifies others through individual non-blocking DMAs • SIG: Like OTA, but others notify root through a signal register in OR mode • Degree-k TREE • In each step, a node has k-1 children • In the first phase, children notify parents • In the second phase, parents acknowledge children
Barrier (2) • PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension • DIS: In step i, SPU j sends to SPU j + 2i and receives from j – 2i Comparison of MPI_Barrier on different hardware • Alternatives • Atomic increments in main memory – several microseconds • PPE coordinates using mailbox – tens of microseconds
Broadcast (1) • OTA on 4 SPUs • OTA: Each SPE copies data to its location • Different shifts are used to avoid hotspots in memory • Different shifts on larger number of SPUs yield results that are close to each other • AG on 16 SPUs • AG: Each SPE is responsible for a different portion of data • Different minimum sizes are tried
Broadcast (2) • TREEMM on 12 SPUs • TREEMM: Tree structured Send/Recv type implementation • Data for degrees 2 and 4 are close • Degree 3 is best, or close to it, for all SPU counts • TREE on 16 SPUs • TREE: Pipelined tree structured communication based on local stores • Results are similar to this figure for other SPU counts
Broadcast (3) • Broadcast on 16 SPEs (2 processors) • TREE: Pipelined tree structured communication based on LS • TREEMM: Tree structured Send/Recv type implementation • AG: Each SPE is responsible for a different portion of data • OTA: Each SPE copies data to its location • G: Root copies all data • Broadcast with good choice of algorithms for each data size and SPE count • Maximum main memory bandwidth is also shown
Broadcast (4) • Each node of the SX-8 has 8 vector processors capable of 16 Gflop/s, with 64 GB/s bandwidth to memory from each processor • The total bandwidth to memory for a node is 512 GB/s • Nodes are connected through a crossbar switch capable of 16 GB/s in each direction • The Altix is a CC-NUMA system with a global shared memory • Each node contains eight Itanium 2 processors • Nodes are connected using NUMALINK4 -- bandwidth between processors on a node is 3.2 GB/s, and between nodes 1.6 GB/s Comparison of MPI_Bcast on different hardware
Reduce • Reduce of MPI_INT with MPI_SUM on 16 SPUs • Similar trends were observed for other SPU counts Comparison of MPI_Bcast on different hardware • Each node of the IBM SP was a 16-processor SMP
Conclusions and Future Work Conclusions The Cell processor has good potential for MPI implementations PPE should have a limited role High bandwidth and low latency even with application data in main memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck Current and future work Implemented Collective communication operations optimized for contiguous data Future work Optimize collectives for derived data types with non-contiguous data