April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Collective Operations for Wide-Area Message Passing Systems Using Dynamically Created Spanning Trees April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Background • Opportunities to perform message passing in WANs are increasing WAN

Message Passing in WANs • WAN → more resources • However, systems designed for LANs do not perform well in WANs Collective operations (broadcast, reduction) designed for LANs perform horribly in WANs

root 3 5 2 2 1 2 1 1 0 2 1 1 reduction 2 1 0 0 Collective Operations • Operations in which all processors participate (cf. send/receive) • Ex. broadcast, reduction ∑

LAN Collective Operations in WANs • Topology must be considered for high performance LAN LAN • Manual configuration is undesirable • Processors should be able to join/leave LAN

Objective • To design and implement collective operations • w/ high performance in WANs • w/o manual configuration • w/ support for joining/leaving processors

Introduction • Related Work • Our Proposal • Preliminary Experiments • Conclusion and Future Work

root Binomial Tree Collective Operations of MPICH • MPICH (Thakur et al. 2003) • Latency-aware tree for short messages • Bandwidth-aware tree for long messages

Ring All-gather Collective Operations of MPICH root Scatter

Collective Operations of MPICH • MPICH assumes that latency and bandwidth are uniform • But, latency and bandwidth are orders of magnitude different within local-area and wide-area links • Collective operations for LANs do not perform well in WANs

High-Performance Collective Operations for WANs • MagPIe (Kielmann et al. 1999) • Bandwidth-Efficient Collective Operations (Kielmann et al. 2000) • MPICH-G2 (Karonis et al. 2003) • Manual configuration necessary • Processors cannot join/leave LAN LAN LAN

Overview of Our Proposal • Dynamically create/maintain 2 spanning trees (latency-aware and bandwidth-aware) for each processor • Perform collective operations along those trees • Provide a mechanism to support joining/leaving processors • Implement as an extension to the Phoenix Message Passing Library

{3} Processor C ph_send(3) ph_send(3) {0, 1, 2} {3, 4} {0, 1, 2} {4} Processor A Processor B Processor B Processor A Phoenix (Taura et al. 2003) • Message passing library for Grids • Not an impl. of MPI, but has its own API • Messages are sent to virtual nodes, not processors

Not too deep, not too big a fan-out LAN root Minimal number of wide-area relationships LAN LAN Latency-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent for each spanning tree using RTT measured at runtime

LAN Wide-area relationships are quickly replaced by local-area relationships LAN Parent Selection • Change parents if both RTTn,c < RTTn,p AND RTTc,r < RTTn,r r RTTc,r RTTn,r c p RTTn,p RTTn,c n

Will nodes that are placed too deep move up? Will nodes that are placed too shallow move down? Tree Creation within a LAN Force that makes tree shallower Force that makes tree deeper Tree that is not too deep, not too shallow

Bandwidth-AwareSpanning Tree Algorithm • Each processor looks for a suitable parent using bandwidth measured at runtime LAN Place processors as far away as possible from the root without sacrificing bandwidth LAN LAN

Become a child of a sibling if it does not sacrifice bandwidth, to create long pipes Fan-out too large! Parent Selection • Find a parent with high bandwidth to the root • Estimate BWn-c-r as min(BWn-c, BWc-r) r p c BWn-p-r BWn-c-r n

Changing Topology {0} {1, 3, 4} {2, 5} {2, 5} {1} {2} {3, 4} {5} {4} {3,5} {4} point-to-point message to virtual node 5 Broadcast Stable Topology {0} {1} {2} {3} {4} {5}

Changing Topology {0} timeout {1} {2} waiting for virtual node 5… {3,5} {4} Reduction Stable Topology {0} {1} {2} {3} {4} {5}

Preliminary Experiments • Latency-aware Spanning Tree Creation (Java Applet) • Stable-state short-message broadcast • Stable-state short-message reduction • Transient-state short-message broadcast

Broadcast (Stable-State) 1B broadcast over 201 processors in 3 clusters topology-unaware implementation topology-aware implementation our implementation

Reduction (Stable-State) Reduction using 128 processors in 3 clusters topology-unaware implementation our implementation topology-aware implementation

Transient-State Behavior • 201 processors in 3 clusters (1 virtual node per processor) • Repeatedly perform broadcasts • 100 processors leave after 60 secs • virtual nodes are remapped to remaining processors • 100 processors re-join after 30 secs • virtual nodes are given back to original processors

Transient-State Behavior Join Leave

Conclusion • Designed and implemented latency-aware broadcast and reduction for wide-area networks • Showed that they perform reasonably well in stable topologies • Showed that they support joining/leaving processors • Future Work • Implement bandwidth-aware spanning tree

Publications • 斎藤秀雄，田浦健次朗，近山隆．動的スパニングツリーを用いた広域メッセージパッシング用の集合通信．In Symposium on Advanced Computing System and Infrastructures．May 2005 (ポスター論文，To Appear）． • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Expedite: An Operating System Extension to Support Low-Latency Communication in Non-Dedicated Clusters. In IPSJ Transactions on Advanced Computing Systems. October 2004. • Hideo Saito, Kenjiro Taura, and Takashi Chikayama. Collective Operations for the Phoenix Programming Model. In Summer United Workshops on Parallel, Distributed, and Cooperative Processing. July 2004.

topology-aware implementation our implementation topology-unaware implementation Broadcast (Stable-State) Broadcast over 251 processors in 3 clusters

April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

April 15, 2005 Chikayama-Taura Laboratory 46411 Hideo Saito

Presentation Transcript

Estonia April 2005

April 27, 2005

April 7, 2005

April 15!

Akiyoshi Saito

Friday, April 15, 2005 Agenda

April 2005

Kazuo SAITO Second Laboratory, Forecast Research Department,

DSTP-OR AERA Presentation Montreal April 15, 2005

April 23, 2005

Diana Welling Comp 521 UOPHX April 15, 2005

April 2005

April 11, 2005

April, 2005

April 2005

April 2005

Mainstreaming Learning Seminar, 14-15 April 2005, Dublin

North Dakota State University, April 14, 2005 University of North Dakota, April 15, 2005

2005 IIDA LABORATORY

FRIDAY 15 TH APRIL 2005

EE 563 Spring 2005 Presentations April 15, 2005 Jim Harris

Toshikatsu Saito