1 / 1

Locality-aware connection establishment

Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura (University of Tokyo) {h_saito, tau}@logos.ic.i.u-tokyo.ac.jp. Experimental results. Cluster D (64 nodes). Experimental environment. Overview.

benjamin
Download Presentation

Locality-aware connection establishment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura(University of Tokyo) {h_saito, tau}@logos.ic.i.u-tokyo.ac.jp Experimental results Cluster D (64 nodes) • Experimental environment Overview • Profiling-based optimizations for wide-area message passing systems • Locality-aware connection management • Locality-aware rank assignment • Multi-Cluster MPI (MC-MPI) • An adaptive wide-area message passing system that uses the proposed optimizations • C and Fortran bindings for most of MPI 1.1 • Performance evaluation using the NAS Parallel Benchmarks • 256 real nodes distributed across 4 clusters • The arrows indicate the directions in which connections could be established (Cluster B had a firewall that allowed outgoing connections but prevented incoming connections) • The times above the arrows indicate the inter-cluster RTT (the intra-cluster RTT was between 60 and 120 microseconds) 10.8 ms 6.8 ms 6.9 ms Cluster A (64 nodes) 4.4 ms 4.3 ms Cluster B (64 nodes) Cluster C (64 nodes) 0.3 ms Related work FW • Wide-area message passing systems • MPICH-G2 [Karonis et al., ‘03], Grid MPI [Matsuda et al., ‘05], MPICH/MADIII [Aumage et al., ‘03] • P2P overlay networks • Bamboo [Rhea et al., ‘01] • Performance of the NPB with varying numbers of connections Profiling run • Obtain a traffic matrix T and a latency matrix L from a profiling run • Traffic matrix (T = {tij}) • tij: traffic (number of messages) between ranks i and j • Execute the application for a short amount of time and make tij the number of messages transmitted during that time • Latency matrix (L = {lij}) • lij: latency (measured or estimated RTT) between processes i and j • Use the triangle inequality to estimate RTTs between faraway processes (c) IS (b) EP (a) BT Locality-aware connection establishment • Establish connections between just a subset of all process pairs (n: number of processes,: parameter that controls connection density) • Select all  of the  processes with the shortest lij • Select  of the (2k-1+1)-st to the (2k)-th shortest lij, where the probability that process j will be selected is proportional to tij (k = 1, 2, ..., log2n/) • Satisfied properties (assume, for simplicity, that the n processes are distributed equally among c clusters) • Connections established by each process: O(logn) • Inter-cluster connections established: O(nlogc) • Build a routing layer using the selected connections • Lazy connection establishment • Establish selected connections on demand • Further reduces the number of connections that are established for applications in which each process only communicates with a few other processes (e.g., SOR) (f) SP (e) MG (d) LU • Comparison of lazy connection establishment methods • MC-MPI •  was selected so that the maximum percentage of connections allowed by each process was 30% • MPICH-like • Established connections on demand without preselecting candidate connections (another way to think of this is that it preselected all connections) Unestablished candidate connection Established connection Locality-aware rank assignment • Performance of the NPB with different rank assignments • Find a rank-process mapping with low communication overhead • Map the rank assignment problem to the Quadratic Assignment Problem • Quadratic Assignment Problem (QAP) • Given two nxn cost matrices, T and L, find a permutation p of {0, 1, ..., n-1} that minimizes: • Solving QAPs • The QAP is NP-Hard, but there are heuristics to find good suboptimal solutions • Library based on GRASP (Greedy, Randomized, Adaptive Search Procedure) [Resende et al., ‘96] • Test against QAPLIB [Burkard et al., ‘97], a publicly available collections of QAPs • Instances of up to n = 256 • n processors for problem size n • Approximate solutions that were within one to two percent of the best known solution in under one second • QAP (MC-MPI) • Assigned ranks based on our locality-aware rank assignment scheme • Hostname • Sorted the processes by host name and assigned ranks in that order • Random • Assigned ranks randomly • BT, LU and SP • MC-MPI performed just as well as Hostname and much better than Random • MG • MC-MPI outperformed not only Random but also Hostname • Rank 0 communicated mostly with ranks 1, 3, 4, 28, 32 and 224 • EP and IS • All three rank assignments performed the same • EP involved little communication • IS had a uniform communication pattern Future work • An API to allow profiling to be performed within a single run • Full paper to appear in CCGRID’07 (Rio de Janeiro, May 14-17, 2007)

More Related