10 likes | 303 Views
Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura (University of Tokyo) {h_saito, tau}@logos.ic.i.u-tokyo.ac.jp. Experimental results. Cluster D (64 nodes). Experimental environment. Overview.
E N D
Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura(University of Tokyo) {h_saito, tau}@logos.ic.i.u-tokyo.ac.jp Experimental results Cluster D (64 nodes) • Experimental environment Overview • Profiling-based optimizations for wide-area message passing systems • Locality-aware connection management • Locality-aware rank assignment • Multi-Cluster MPI (MC-MPI) • An adaptive wide-area message passing system that uses the proposed optimizations • C and Fortran bindings for most of MPI 1.1 • Performance evaluation using the NAS Parallel Benchmarks • 256 real nodes distributed across 4 clusters • The arrows indicate the directions in which connections could be established (Cluster B had a firewall that allowed outgoing connections but prevented incoming connections) • The times above the arrows indicate the inter-cluster RTT (the intra-cluster RTT was between 60 and 120 microseconds) 10.8 ms 6.8 ms 6.9 ms Cluster A (64 nodes) 4.4 ms 4.3 ms Cluster B (64 nodes) Cluster C (64 nodes) 0.3 ms Related work FW • Wide-area message passing systems • MPICH-G2 [Karonis et al., ‘03], Grid MPI [Matsuda et al., ‘05], MPICH/MADIII [Aumage et al., ‘03] • P2P overlay networks • Bamboo [Rhea et al., ‘01] • Performance of the NPB with varying numbers of connections Profiling run • Obtain a traffic matrix T and a latency matrix L from a profiling run • Traffic matrix (T = {tij}) • tij: traffic (number of messages) between ranks i and j • Execute the application for a short amount of time and make tij the number of messages transmitted during that time • Latency matrix (L = {lij}) • lij: latency (measured or estimated RTT) between processes i and j • Use the triangle inequality to estimate RTTs between faraway processes (c) IS (b) EP (a) BT Locality-aware connection establishment • Establish connections between just a subset of all process pairs (n: number of processes,: parameter that controls connection density) • Select all of the processes with the shortest lij • Select of the (2k-1+1)-st to the (2k)-th shortest lij, where the probability that process j will be selected is proportional to tij (k = 1, 2, ..., log2n/) • Satisfied properties (assume, for simplicity, that the n processes are distributed equally among c clusters) • Connections established by each process: O(logn) • Inter-cluster connections established: O(nlogc) • Build a routing layer using the selected connections • Lazy connection establishment • Establish selected connections on demand • Further reduces the number of connections that are established for applications in which each process only communicates with a few other processes (e.g., SOR) (f) SP (e) MG (d) LU • Comparison of lazy connection establishment methods • MC-MPI • was selected so that the maximum percentage of connections allowed by each process was 30% • MPICH-like • Established connections on demand without preselecting candidate connections (another way to think of this is that it preselected all connections) Unestablished candidate connection Established connection Locality-aware rank assignment • Performance of the NPB with different rank assignments • Find a rank-process mapping with low communication overhead • Map the rank assignment problem to the Quadratic Assignment Problem • Quadratic Assignment Problem (QAP) • Given two nxn cost matrices, T and L, find a permutation p of {0, 1, ..., n-1} that minimizes: • Solving QAPs • The QAP is NP-Hard, but there are heuristics to find good suboptimal solutions • Library based on GRASP (Greedy, Randomized, Adaptive Search Procedure) [Resende et al., ‘96] • Test against QAPLIB [Burkard et al., ‘97], a publicly available collections of QAPs • Instances of up to n = 256 • n processors for problem size n • Approximate solutions that were within one to two percent of the best known solution in under one second • QAP (MC-MPI) • Assigned ranks based on our locality-aware rank assignment scheme • Hostname • Sorted the processes by host name and assigned ranks in that order • Random • Assigned ranks randomly • BT, LU and SP • MC-MPI performed just as well as Hostname and much better than Random • MG • MC-MPI outperformed not only Random but also Hostname • Rank 0 communicated mostly with ranks 1, 3, 4, 28, 32 and 224 • EP and IS • All three rank assignments performed the same • EP involved little communication • IS had a uniform communication pattern Future work • An API to allow profiling to be performed within a single run • Full paper to appear in CCGRID’07 (Rio de Janeiro, May 14-17, 2007)