80 likes | 224 Views
Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture. Kumar 1 , G. Senthilkumar 1 , M. Krishna 1 , N. Jayam 1 , P.K. Baruah 1 , R. Sarma 1 , S. Kapoor 2 , A. Srinivasan 3 1 Sri Sathya Sai University, Prashanthi Nilayam, India
E N D
Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BETM Architecture Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1, S. Kapoor2, A. Srinivasan3 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin, skapoor@us.ibm.com 3 Florida State University, asriniva@cs.fsu.edu Goals Determine the feasibility of Intra-Cell MPI Evaluate the impact of different design choices on performance
Cell Architecture DMA put times • A PowerPC core, with 8 co-processors (SPE) with 256 K local store each • Shared 512 MB - 2 GB main memory - SPEs can DMA • Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops in double precision for SPEs • 204.8 GB/s EIB bandwidth, 25.6 GB/s for memory • Two Cell processors can be combined to form a Cell blade with global shared memory • Memory to memory copy using: • SPE local store • memcpy by PPE
Intra-Cell MPI Design Choices • Cell features • In order execution, but DMAs can be out of order • Over 100 simultaneous DMAs can be in flight • Constraints • Unconventional, heterogeneous architecture • SPEs have limited functionality, and can act directly only on local stores • SPEs access main memory through DMA • Use of PPE should be limited to get good performance • MPI design choices • Application data in: (i) local store or (ii) main memory • MPI meta-data in: (i) local store or (ii) main memory • PPE involvement: (i) active or (ii) only during initialization and finalization • Point-to-point communication mode: (i) synchronous or (ii) buffered
Blocking Point-to-Point Communication Performance • Results are from a 3.2 GHz Cell Blade, at IBM Rochester • The final version uses buffered mode for small messages and synchronous mode for long messages • Threshold to switch to Synchronous mode is set to 2KB • In these figures, the default is for Application data to be in main memory, MPI data in Local Store, no congestion, and limited PPE involvement
Collective Communication Example – Broadcast • Broadcast on 16 SPEs (2 processors) • TREE: Pipelined tree structured communication based on LS • TREEMM: Tree structured Send/Recv type implementation • AG: Each SPE is responsible for a different portion of data • OTA: Each SPE copies data to its location • G: Root copies all data • Broadcast with good choice of algorithms for each data size and SPE count • Maximum main memory bandwidth is also shown
Application Performance – Matrix-Vector Multiplication • Used a 1-D decomposition (not very efficient) • Achieved a peak double precision throughput of 7.8 Gflop/s for matrices of size of 1024 • The collective used was from an older implementation on the Cell, built on top of Send/Recv using a tree structured communication • The Opteron results used LAM MPI Performance of Double Precision matrix-vector multiplication
Conclusions and Future Work Conclusions The Cell processor has good potential for MPI applications. PPE should have a very limited role Very high bandwidths with application data in local store High bandwidth and low latency even with application data in main memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck Good performance for collectives even with two Cell processors Current and future work Implemented Collective communication operations optimized for contiguous data Blocking and non-blocking communication Future work Optimize collectives for derived data types with non-contiguous data Optimize point-to-point communication on blade with two processors More features, such as topologies, etc