310 likes | 449 Views
Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees. April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito. Background. Opportunities to perform message passing in WANs are increasing. WAN. Message Passing in WANs. WAN → more resources
E N D
Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito
Background • Opportunities to perform message passing in WANs are increasing WAN
Message Passing in WANs • WAN → more resources • However, systems designed for LANs do not perform well in WANs Collective operations (broadcast, reduction) designed for LANs perform horribly in WANs
root 3 5 2 2 1 2 1 1 0 2 1 1 reduction 2 1 0 0 Collective Operations • Operations in which all processors participate (cf. send/receive) • Ex. broadcast, reduction ∑
LAN Collective Operations in WANs • Topology must be considered for high performance LAN LAN • Manual configuration is undesirable • Processors should be able to join/leave LAN
Objective • To design and implement collective operations • w/ high performance in WANs • w/o manual configuration • w/ support for joining/leaving processors
Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work
root Binomial Tree Collective Operations of MPICH • MPICH (Thakur et al. 2003) • Latency-aware tree for short messages • Bandwidth-aware tree for long messages
Ring All-gather Collective Operations of MPICH root Scatter
Collective Operations of MPICH • MPICH assumes that latency and bandwidth are uniform • But, latency and bandwidth are orders of magnitude different within local-area and wide-area links • Collective operations for LANs do not perform well in WANs
High-Performance Collective Operations for WANs • MagPIe (Kielmann et al. 1999) • Bandwidth-Efficient Collective Operations (Kielmann et al. 2000) • MPICH-G2 (Karonis et al. 2003) • Manual configuration necessary • Processors cannot join/leave LAN LAN LAN
Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work
Overview of Our Proposal • Dynamically create/maintain 2 spanning trees (latency-aware and bandwidth-aware) for each processor • Perform collective operations along those trees • Provide a mechanism to support joining/leaving processors • Implement as an extension to the Phoenix Message Passing Library
{3} Processor C ph_send(3) ph_send(3) {0, 1, 2} {3, 4} {0, 1, 2} {4} Processor A Processor B Processor B Processor A Phoenix (Taura et al. 2003) • Message passing library for Grids • Not an impl. of MPI, but has its own API • Messages are sent to virtual nodes, not processors
Not too deep, not too big a fan-out LAN root Minimal number of wide-area relationships LAN LAN Latency-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent for each spanning tree using RTT measured at runtime
LAN Wide-area relation- ships are quickly replaced by local-area relationships LAN Parent Selection • Change parents if both RTTn,c < RTTn,p AND RTTc,r < RTTn,r r RTTc,r RTTn,r c p RTTn,p RTTn,c n
Will nodes that are placed too deep move up? Will nodes that are placed too shallow move down? Tree Creation within a LAN Force that makes tree shallower Force that makes tree deeper Tree that is not too deep, not too shallow
Bandwidth-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent using bandwidth measured at runtime LAN Place processors as far away as possible from the root without sacrificing bandwidth LAN LAN
Become a child of a sibling if it does not sacrifice bandwidth, to create long pipes Fan-out too large! Parent Selection • Find a parent with high bandwidth to the root • Estimate BWn-c-r as min(BWn-c, BWc-r) r p c BWn-p-r BWn-c-r n
Changing Topology {0} {1, 3, 4} {2, 5} {2, 5} {1} {2} {3, 4} {5} {4} {3,5} {4} point-to-point message to virtual node 5 Broadcast Stable Topology {0} {1} {2} {3} {4} {5}
Changing Topology {0} timeout {1} {2} waiting for virtual node 5… {3,5} {4} Reduction Stable Topology {0} {1} {2} {3} {4} {5}
Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work
Preliminary Experiments • Latency-aware Spanning Tree Creation (Java Applet) • Stable-state short-message broadcast • Stable-state short-message reduction • Transient-state short-message broadcast
Broadcast (Stable-State) 1B broadcast over 201 processors in 3 clusters topology-unaware implementation topology-aware implementation our implementation
Reduction (Stable-State) Reduction using 128 processors in 3 clusters topology-unaware implementation our implementation topology-aware implementation
Transient-State Behavior • 201 processors in 3 clusters (1 virtual node per processor) • Repeatedly perform broadcasts • 100 processors leave after 60 secs • virtual nodes are remapped to remaining processors • 100 processors re-join after 30 secs • virtual nodes are given back to original processors
Transient-State Behavior Join Leave
Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work
Conclusion • Designed and implemented latency-aware broadcast and reduction for wide-area networks • Showed that they perform reasonably well in stable topologies • Showed that they support joining/leaving processors • Future Work • Implement bandwidth-aware spanning tree
Publications • 斎藤秀雄,田浦健次朗,近山隆.動的スパニングツリーを用いた広域メッセージパッシング用の集合通信.In Symposium on Advanced Computing System and Infrastructures.May 2005 (ポスター論文,To Appear). • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Expedite: An Operating System Extension to Support Low-Latency Communication in Non-Dedicated Clusters. In IPSJ Transactions on Advanced Computing Systems. October 2004. • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Collective Operations for the Phoenix Programming Model. In Summer United Workshops on Parallel, Distributed, and Cooperative Processing. July 2004.
topology-aware implementation our implementation topology-unaware implementation Broadcast (Stable-State) Broadcast over 251 processors in 3 clusters