210 likes | 217 Views
This research paper explores the benefits of executing MPI nodes as threads on SMP clusters, with a focus on inter-machine communication optimization. The study includes a comparison of TMPI and MPICH implementations and an experimental analysis of performance on a cluster of SMPs.
E N D
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara
Parallel Computation on SMP Clusters • Massively Parallel Machines SMP Clusters • Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) • Parallel Programming Model for SMP Clusters • MPI: Portability, Performance, Legacy Programs • MPI+Variations: MPI+Multithreading, MPI+OpenMP Hong Tang
Threaded MPI Execution • MPI Paradigm: Separated Address Spaces for Different MPI Nodes • Natural Solution: MPI Nodes Processes • What if we map MPI nodes to threads? • Faster synchronization among MPI nodes running on the same machine. • Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) • Threaded MPI Execution on SMP Clusters • Intra-Machine Comm. through Shared Memory • Inter-Machine Comm. through Network Hong Tang
Threaded MPI Execution Benefits Inter-Machine Communication • Common Intuition • Our Findings Inter-machine communication cost is dominated by network delay, so the advantage of executing MPI nodes as threads diminishes. Using threads can significantly reduce the buffering and orchestration overhead for inter-machine communications. Hong Tang
Related Work • MPI on Network Clusters • MPICH –a portable MPI implementation. • LAM/MPI – communication through a standalone RPI server. • Collective Communication Optimization • SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. • MagPIe – target for SMP clusters connected through WAN. • Lower Communication Layer Optimization • MPI-FM and MPI-AM. • Threaded Execution of Message Passing Programs • MPI-Lite, LPVM, TPVM. Hong Tang
Background: MPICH Design Hong Tang
MPICH without shared memory MPICH Communication Structure MPICH with shared memory Hong Tang
TMPI Communication Structure Hong Tang
Comparison of TMPI and MPICH • Drawbacks of MPICH w/ Shared Memory • Intra-node communication limited by shared memory size. • Busy polling to check messages from either daemon or local peer. • Cannot do automatic resource clean-up. • Drawbacks of MPICH w/o Shared Memory • Big overhead for intra-node communication. • Too many daemon processes and open connections. • Drawbacks of both MPICH Systems • Extra data copying for inter-machine communication. Hong Tang
TMPICommunicationDesign Hong Tang
Separation of Point-to-Point and Collective Communication Channels • Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different. • Separated channels for pt2pt and collective comm. • Eliminate daemon intervention for collective communication. • Less effective for MPICH – no sharing of ports among processes. Hong Tang
Hierarchy-Aware Collective Communication • Observation: Two level communication hierarchy. • Inside an SMP node: shared memory (10-8 sec) • Between SMP nodes: network (10-6 sec) • Idea: Building the communication spanning tree in two steps • Choose a root MPI node on each cluster node and build a spanning tree among all the cluster nodes. • Second, all other MPI nodes connect to the local root node. Hong Tang
Adaptive Buffer Management • Question: How do we manage temporary buffering of message data when the remote receiver is not ready to accept data? • Choices: • Send the data with the request – eager push. • Send request only and send data when the receiver is ready – three-phase protocol. • TMPI – adapt between both methods. Hong Tang
Experimental Study • Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. • Hardware Setting • A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. • Software Setting • OS: RedHat Linux 6.0, kernel version 2.2.15 w/ channel bonding enabled. • Process-based MPI System: MPICH 1.2 • Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard) Hong Tang
Inter-Cluster-Node Point-to-Point • Ping-ping, TMPI vs MPICH w/ shared memory (a) Ping-Pong Short Message (b) Ping-Pong Long Message 700 20 600 ) 18 s m 16 500 TMPI Transfer Rate (MB) 14 Round Trip Time ( MPICH 400 12 TMPI MPICH 10 300 8 200 0 200 400 600 800 1000 0 200 400 600 800 1000 Message Size (bytes) Message Size (KB) Hong Tang
Intra-Cluster-Node Point-to-Point • Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory) (a) Ping-Pong Short Message (b) Ping-Pong Long Message 180 TMPI MPICH1 160 MPICH2 200 ) s m 140 TMPI MPICH1 120 MPICH2 150 Transfer Rate (MB) 100 Round Trip Time ( 80 100 60 40 50 20 0 200 400 600 800 1000 0 200 400 600 800 1000 Message Size (bytes) Message Size (KB) Hong Tang
3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases. 2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce. 1) MPICH w/o shared memory performs the worst. Collective Communication • Reduce, Bcast, Allreduce. • TMPI / MPICH_SHM / MPICH_NOSHM • Three node distributions, three root node settings. Hong Tang
Macro-Benchmark Performance Hong Tang
Conclusions • Great Advantage of Threaded MPI Execution on SMP Clusters • Micro-benchmark: 70+ times faster than MPICH. • Macro-benchmark: 100% faster than MPICH. • Optimization Techniques • Separated Collective and Point-to-Point Communication Channels • Adaptive Buffer Management • Hierarchy-Aware Communications http://www.cs.ucsb.edu/projects/tmpi/ Hong Tang
Background: Safe Execution of MPI Programs using Threads • Program Transformation: Eliminate global and static variables (called permanent variables). • Thread-Specific Data (TSD) Each thread can associate a pointer-sized data variable with a commonly defined key value (an integer). With the same key, different threads can set/get the values of their own copy of the data variable. • TSD-based TransformationEach permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables. Hong Tang
Program Transformation –An Example Hong Tang