450 likes | 589 Views
Locality-aware Connection Management and Rank Assignment for Wide-area MPI. Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007. Background. Increase in the bandwidth of WANs ➭ More opportunities to perform parallel computation using multiple clusters. WAN.
E N D
Locality-aware Connection Management and Rank Assignment for Wide-area MPI Hideo Saito Kenjiro Taura The University of Tokyo May 16, 2007
Background • Increase in the bandwidth of WANs ➭ More opportunities to perform parallel computation using multiple clusters WAN
Requirements for Wide-area MPI • Wide-area connectivity • Firewalls and private addresses • Only some nodes can connect to each other • Perform routing using the connections that happen to be possible NAT Firewall
Reqs. for Wide-area MPI (2) • Scalability • The number of conns. must be limited in order to scale to thousands of nodes • Various allocation limits of the system (e.g., memory, file descriptors, router sessions) • Simplistic schemes that may potentially result in O(n2) connections won’t scale • Lazy connect strategies work formany apps, but not for those that involve all-to-all communication
Reqs. for Wide-area MPI (3) • Locality awareness • To achieve high performance with few conns, select conns. in a locality-aware manner • Many connections with nearby nodes, few connections with faraway nodes Few conns. between clusters Many conns. within a cluster
Reqs. for Wide-area MPI (4) • Application awareness • Select connections according to the application’s communication pattern • Assign ranks* according to the application’s communication pattern • Adaptivity • Automatically, without tedious manual configuration * rank = process ID in MPI
Contributions of Our Work • Locality-aware connection management • Uses latency and traffic information obtained from a short profiling run • Locality-aware rank assignment • Uses the same info. to discover rank-process mappings with low comm. overhead ➭ Multi-Cluster MPI (MC-MPI) • Wide-area-enabled MPI library
Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion
Grid-enabled MPI Libraries • MPICH-G2 [Karonis et al. ‘03], MagPIe [Kielmann et al. ‘99] • Locality-aware communication optimizations • E.g., wide-area-aware collective operations (broadcast, reduction, ...) • Doesn’t work with Firewalls
Grid-enabled MPI Libraries (cont’d) • MPICH/MADIII [Aumage et al. ‘03], StaMPI [Imamura et al. ‘00] • Forwarding mechanisms that allow nodes to communicate even in the presence of FWs • Manual configuration • Amount of necessary config. becomes overwhelming as more resources are used Forward Firewall
P2P Overlays • Pastry [Rowstron et al. ’00] • Each node maintains just O(log n) connections • Messages are routed using those connections • Highly scalable, but routing properties are unfavorable for high performance computing • Few connections between nearby nodes • Messages between nearby nodes need to be forwarded, causing large latency penalties
Adaptive MPI Physical Processor Virtual Processor • Huang et al. ‘06 • Performs load balancing by migrating virtual processors • Balance the exec. times of the physical processors • Minimize inter-processor communication • Adapts to apps. by tracking the amount of communication performed between procs. • Assumes that the communication cost of every processor pair is the same • MC-MPI takes differences in communication costs into account
Lazy Connect Strategies • MPICH [Gropp et al. ‘96], Scalable MPI over Infiniband [Yu et al. ‘06] • Establish connections only on demand • Reduces the number of conns. if each proc. only communicates with a few other procs. • Some apps. generate all-to-all comm. patterns, resulting in many connections • E.g., IS in the NAS Parallel Benchmarks • Doesn’t extend to wide-area environments where some communication may be blocked
Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion
Overview of Our Method Short Profiling Run • Latency matrix (L) • Traffic matrix (T) Optimized Real Run • Locality-aware connection management • Locality-aware rank assignment
Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion
Latency Matrix • Latency matrix L = {lij} • lij: latency between processes i and j in the target environment • Each process autonomously measures the RTT between itself and other processes • Reduce the num. of measurements by using the triangular inequality to estimate RTTs r if rttpr>αrttrq: rttpq=rttpr (α: constant) rttpr rttrq q p rttpq
Traffic Matrix • Traffic matrix T = {tij} • tij: traffic between ranks i and j in the target application • Many applications repeat similar communication patterns ➭ Execute the application for a short amount of time and make tij the number of transmitted messages (E.g., one iteration of an iterativeapp.)
Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion
Connection Management Establishcandidateconnectionson demand Candidate connections Bounding Graph Lazy Connection Establishment Spanning Tree Application Body MPI_Init
Many nearby processes far near Few faraway processes Selection of Candidate Connections • Each process selects O(log n) neighbors based on L and T • : parameter that controls connection density • n: number of processes ... /4 / /2
Temporary connections Bounding Graph • Procs. try to establish temporary conns. to their selected neighbors • The collective set ofsuccessful connections ➭ Bounding graph • (Some conns. may fail due to FWs) Bounding Graph
Routing Table Construction • Construct a routing table using just the bounding graph • Close the temporary connections • Conns. of the bounding graph are reestablished lazily as “real” conns. • Temporary conns. => small bufs. • Real conns. => large bufs. Bounding Graph
Connect in reverse direction FW FW Send connect request using spanning tree Lazy connect fails due to FW Spanning Tree Lazy Connection Establishment FW Bounding Graph
Outline • Introduction • Related Work • Proposed Method • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion
Commonly-used Method • Sort the processes by host name (or IP address) and assign ranks in that order • Assumptions • Most communication takes place between processes with close ranks • The communication cost between processes with close host names is low • However, • Applications have various comm. patterns • Host names don’t necessarily have a correlation to communication costs
Our Rank Assignment Scheme • Find a rank-process mapping with low communication overhead • Map the rank assignment problem to the Quadratic Assignment Problem • QAP • Given two nxn cost matrices, L and T, find a permutation p of {0, 1, ..., n-1} that minimizes:
Solving QAPs • NP-Hard, but there are heuristics for finding good suboptimal solutions • Library based on GRASP [Resende et al. ’96] • Test against QAPLIB [Burkard et al. ’97] • Instances of up to n = 256 • n processors for problem size n • Approximate solutions that were within one to two percent of the best known solution in under one second
Outline • Introduction • Related Work • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion
Experimental Environment • Xeon/Pentium M • Linux • Intra-cluster RTT: 60-120 microsecs • TCP send/recv bufs: 256KB ea. sheepXX (64 nodes) 10.8ms chibaXXX (64 nodes) 6.8ms 6.9ms 4.4ms 4.3ms istbsXXX (64 nodes) 0.3 ms hongoXXX (64 nodes) FW
Experiment 1: Conn. Management • Measure the performance of the NPB with limited numbers of connections • MC-MPI • Limit the number of connections to 10%, 20%, ..., 100% by varying • Random • Establish a comparable number of connections randomly
BT, LU, MG and SP SOR (Successive Over-Relxation) LU (Lower-Upper)
BT, LU, MG and SP (2) MG (Multi-Grid) BT (Block Tridiagonal)
BT, LU, MG and SP (3) • % of connections actually established was lower than that shown by the x-axis • B/c of lazy connection establishment • To be discussed in more detail later SP (Scalar Pentadiagonal)
EP • EP involves very little communication EP (Embarrassingly Parallel)
IS Performance decrease due to congestion! IS (Integer Sort)
Experiment 2: Lazy Conn. Establish. • Compare our lazy conn. establishment method with an MPICH-like method • MC-MPI • Select so that the maximum number of allowed connections is 30% • MPICH-like • Establish connections on demand without preselecting candidate connections(we can also say that we preselect all connections)
Relative Performance Experiment 2: Results Comparable number of conns. except for IS Comparable performance except for IS Connections Established
Experiment 3: Rank Assignment • Compare 3 assignment algorithms • Random • Hostname (24 patterns) • Real host names (1) • What if istbsXXX were named sheepXX, etc. (23) • MC-MPI (QAP) chibaXXX sheepXX hongoXXX istbsXXX
LU and MG MG LU Hostname (Best) Hostname (Worst) Hostname Random MC-MPI (QAP)
BT and SP SP BT Hostname (Best) Hostname (Worst) Hostname Random MC-MPI (QAP)
BT and SP (cont’d) • Rank Assignment • Traffic Matrix Destination Hostname Rank MC-MPI (QAP) Rank Cluster A Cluster C Cluster B Cluster D Source
EP and IS IS EP Hostname (Best) Hostname (Worst) Hostname Random QAP (MC-MPI)
Outline • Introduction • Related Work • Profiling Run • Connection Management • Rank Assignment • Experimental Results • Conclusion
Conclusion • MC-MPI • Connection management • High performance with connections between just 10% of all process pairs • Rank assignment • Up to 300% faster than locality-unaware assignments • Future Work • An API to perform profiling w/in a single run • Integration of adaptive collectives