Optimizing Threaded MPI Execution on SMP Clusters

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara

Parallel Computation on SMP Clusters • Massively Parallel Machines  SMP Clusters • Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) • Parallel Programming Model for SMP Clusters • MPI: Portability, Performance, Legacy Programs • MPI+Variations: MPI+Multithreading, MPI+OpenMP Hong Tang

Threaded MPI Execution • MPI Paradigm: Separated Address Spaces for Different MPI Nodes • Natural Solution: MPI Nodes  Processes • What if we map MPI nodes to threads? • Faster synchronization among MPI nodes running on the same machine. • Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) • Threaded MPI Execution on SMP Clusters • Intra-Machine Comm. through Shared Memory • Inter-Machine Comm. through Network Hong Tang

Threaded MPI Execution Benefits Inter-Machine Communication • Common Intuition • Our Findings Inter-machine communication cost is dominated by network delay, so the advantage of executing MPI nodes as threads diminishes. Using threads can significantly reduce the buffering and orchestration overhead for inter-machine communications. Hong Tang

Related Work • MPI on Network Clusters • MPICH –a portable MPI implementation. • LAM/MPI – communication through a standalone RPI server. • Collective Communication Optimization • SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. • MagPIe – target for SMP clusters connected through WAN. • Lower Communication Layer Optimization • MPI-FM and MPI-AM. • Threaded Execution of Message Passing Programs • MPI-Lite, LPVM, TPVM. Hong Tang

Background: MPICH Design Hong Tang

MPICH without shared memory MPICH Communication Structure MPICH with shared memory Hong Tang

TMPI Communication Structure    Hong Tang

Comparison of TMPI and MPICH • Drawbacks of MPICH w/ Shared Memory • Intra-node communication limited by shared memory size. • Busy polling to check messages from either daemon or local peer. • Cannot do automatic resource clean-up. • Drawbacks of MPICH w/o Shared Memory • Big overhead for intra-node communication. • Too many daemon processes and open connections. • Drawbacks of both MPICH Systems • Extra data copying for inter-machine communication. Hong Tang

TMPICommunicationDesign Hong Tang

Separation of Point-to-Point and Collective Communication Channels • Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different. • Separated channels for pt2pt and collective comm. • Eliminate daemon intervention for collective communication. • Less effective for MPICH – no sharing of ports among processes. Hong Tang

Hierarchy-Aware Collective Communication • Observation: Two level communication hierarchy. • Inside an SMP node: shared memory (10-8 sec) • Between SMP nodes: network (10-6 sec) • Idea: Building the communication spanning tree in two steps • Choose a root MPI node on each cluster node and build a spanning tree among all the cluster nodes. • Second, all other MPI nodes connect to the local root node. Hong Tang

Adaptive Buffer Management • Question: How do we manage temporary buffering of message data when the remote receiver is not ready to accept data? • Choices: • Send the data with the request – eager push. • Send request only and send data when the receiver is ready – three-phase protocol. • TMPI – adapt between both methods. Hong Tang

Experimental Study • Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. • Hardware Setting • A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. • Software Setting • OS: RedHat Linux 6.0, kernel version 2.2.15 w/ channel bonding enabled. • Process-based MPI System: MPICH 1.2 • Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard) Hong Tang

Inter-Cluster-Node Point-to-Point • Ping-ping, TMPI vs MPICH w/ shared memory (a) Ping-Pong Short Message (b) Ping-Pong Long Message 700 20 600 ) 18 s m 16 500 TMPI Transfer Rate (MB) 14 Round Trip Time ( MPICH 400 12 TMPI MPICH 10 300 8 200 0 200 400 600 800 1000 0 200 400 600 800 1000 Message Size (bytes) Message Size (KB) Hong Tang

Intra-Cluster-Node Point-to-Point • Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory) (a) Ping-Pong Short Message (b) Ping-Pong Long Message 180 TMPI MPICH1 160 MPICH2 200 ) s m 140 TMPI MPICH1 120 MPICH2 150 Transfer Rate (MB) 100 Round Trip Time ( 80 100 60 40 50 20 0 200 400 600 800 1000 0 200 400 600 800 1000 Message Size (bytes) Message Size (KB) Hong Tang

3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases. 2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce. 1) MPICH w/o shared memory performs the worst. Collective Communication • Reduce, Bcast, Allreduce. • TMPI / MPICH_SHM / MPICH_NOSHM • Three node distributions, three root node settings. Hong Tang

Macro-Benchmark Performance Hong Tang

Conclusions • Great Advantage of Threaded MPI Execution on SMP Clusters • Micro-benchmark: 70+ times faster than MPICH. • Macro-benchmark: 100% faster than MPICH. • Optimization Techniques • Separated Collective and Point-to-Point Communication Channels • Adaptive Buffer Management • Hierarchy-Aware Communications http://www.cs.ucsb.edu/projects/tmpi/ Hong Tang

Background: Safe Execution of MPI Programs using Threads • Program Transformation: Eliminate global and static variables (called permanent variables). • Thread-Specific Data (TSD) Each thread can associate a pointer-sized data variable with a commonly defined key value (an integer). With the same key, different threads can set/get the values of their own copy of the data variable. • TSD-based TransformationEach permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables. Hong Tang

Program Transformation –An Example Hong Tang

Optimizing Threaded MPI Execution on SMP Clusters

Optimizing Threaded MPI Execution on SMP Clusters

Presentation Transcript

Managing Heterogeneous MPI Application Interoperation and Execution.

Execution of SGE Clusters on top of Hybrid Clouds using OpenNebula

Optimizing workflow execution on the Grid

SHadoop : Improving MapReduce Performance by Optimizing Job Execution Mechanism in Hadoop Clusters

Predicting Execution Bottlenecks in Map-Reduce Clusters

Tarazu Optimizing MapReduce On Heterogeneous Clusters

Execution of SGE Clusters on top of Hybrid Clouds using OpenNebula

MPI on WinNT-Clusters

Optimizing Threaded MPI Execution on SMP Clusters

Kernel-assisted MPI Communication on Multi-core Clusters

Problems in CDM execution in SME clusters

Programming Clusters using Message-Passing Interface (MPI)

More on MPI

Dprocess on SMP

Programming Clusters using Message-Passing Interface (MPI)

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters

MPI and MPICH on Clusters

Optimizing New Product Execution - White Paper