1 / 21

Optimizing Threaded MPI Execution on SMP Clusters

Optimizing Threaded MPI Execution on SMP Clusters. Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara. Parallel Computation on SMP Clusters. Massively Parallel Machines  SMP Clusters

zaynah
Download Presentation

Optimizing Threaded MPI Execution on SMP Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara

  2. Parallel Computation on SMP Clusters • Massively Parallel Machines  SMP Clusters • Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) • Parallel Programming Model for SMP Clusters • MPI: Portability, Performance, Legacy Programs • MPI+Variations: MPI+Multithreading, MPI+OpenMP Hong Tang

  3. Threaded MPI Execution • MPI Paradigm: Separated Address Spaces for Different MPI Nodes • Natural Solution: MPI Nodes  Processes • What if we map MPI nodes to threads? • Faster synchronization among MPI nodes running on the same machine. • Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) • Threaded MPI Execution on SMP Clusters • Intra-Machine Comm. through Shared Memory • Inter-Machine Comm. through Network Hong Tang

  4. Threaded MPI Execution Benefits Inter-Machine Communication • Common Intuition • Our Findings Inter-machine communication cost is dominated by network delay, so the advantage of executing MPI nodes as threads diminishes. Using threads can significantly reduce the buffering and orchestration overhead for inter-machine communications. Hong Tang

  5. Related Work • MPI on Network Clusters • MPICH –a portable MPI implementation. • LAM/MPI – communication through a standalone RPI server. • Collective Communication Optimization • SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. • MagPIe – target for SMP clusters connected through WAN. • Lower Communication Layer Optimization • MPI-FM and MPI-AM. • Threaded Execution of Message Passing Programs • MPI-Lite, LPVM, TPVM. Hong Tang

  6. Background: MPICH Design Hong Tang

  7. MPICH without shared memory MPICH Communication Structure MPICH with shared memory Hong Tang

  8. TMPI Communication Structure    Hong Tang

  9. Comparison of TMPI and MPICH • Drawbacks of MPICH w/ Shared Memory • Intra-node communication limited by shared memory size. • Busy polling to check messages from either daemon or local peer. • Cannot do automatic resource clean-up. • Drawbacks of MPICH w/o Shared Memory • Big overhead for intra-node communication. • Too many daemon processes and open connections. • Drawbacks of both MPICH Systems • Extra data copying for inter-machine communication. Hong Tang

  10. TMPICommunicationDesign Hong Tang

  11. Separation of Point-to-Point and Collective Communication Channels • Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different. • Separated channels for pt2pt and collective comm. • Eliminate daemon intervention for collective communication. • Less effective for MPICH – no sharing of ports among processes. Hong Tang

  12. Hierarchy-Aware Collective Communication • Observation: Two level communication hierarchy. • Inside an SMP node: shared memory (10-8 sec) • Between SMP nodes: network (10-6 sec) • Idea: Building the communication spanning tree in two steps • Choose a root MPI node on each cluster node and build a spanning tree among all the cluster nodes. • Second, all other MPI nodes connect to the local root node. Hong Tang

  13. Adaptive Buffer Management • Question: How do we manage temporary buffering of message data when the remote receiver is not ready to accept data? • Choices: • Send the data with the request – eager push. • Send request only and send data when the receiver is ready – three-phase protocol. • TMPI – adapt between both methods. Hong Tang

  14. Experimental Study • Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. • Hardware Setting • A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. • Software Setting • OS: RedHat Linux 6.0, kernel version 2.2.15 w/ channel bonding enabled. • Process-based MPI System: MPICH 1.2 • Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard) Hong Tang

  15. Inter-Cluster-Node Point-to-Point • Ping-ping, TMPI vs MPICH w/ shared memory (a) Ping-Pong Short Message (b) Ping-Pong Long Message 700 20 600 ) 18 s m 16 500 TMPI Transfer Rate (MB) 14 Round Trip Time ( MPICH 400 12 TMPI MPICH 10 300 8 200 0 200 400 600 800 1000 0 200 400 600 800 1000 Message Size (bytes) Message Size (KB) Hong Tang

  16. Intra-Cluster-Node Point-to-Point • Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory) (a) Ping-Pong Short Message (b) Ping-Pong Long Message 180 TMPI MPICH1 160 MPICH2 200 ) s m 140 TMPI MPICH1 120 MPICH2 150 Transfer Rate (MB) 100 Round Trip Time ( 80 100 60 40 50 20 0 200 400 600 800 1000 0 200 400 600 800 1000 Message Size (bytes) Message Size (KB) Hong Tang

  17. 3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases. 2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce. 1) MPICH w/o shared memory performs the worst. Collective Communication • Reduce, Bcast, Allreduce. • TMPI / MPICH_SHM / MPICH_NOSHM • Three node distributions, three root node settings. Hong Tang

  18. Macro-Benchmark Performance Hong Tang

  19. Conclusions • Great Advantage of Threaded MPI Execution on SMP Clusters • Micro-benchmark: 70+ times faster than MPICH. • Macro-benchmark: 100% faster than MPICH. • Optimization Techniques • Separated Collective and Point-to-Point Communication Channels • Adaptive Buffer Management • Hierarchy-Aware Communications http://www.cs.ucsb.edu/projects/tmpi/ Hong Tang

  20. Background: Safe Execution of MPI Programs using Threads • Program Transformation: Eliminate global and static variables (called permanent variables). • Thread-Specific Data (TSD) Each thread can associate a pointer-sized data variable with a commonly defined key value (an integer). With the same key, different threads can set/get the values of their own copy of the data variable. • TSD-based TransformationEach permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables. Hong Tang

  21. Program Transformation –An Example Hong Tang

More Related