1 / 35

Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees. Hideo Saito , Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo) Grid 2005 (Nov. 13, 2005). Message Passing in WANs. Increase in bandwidth of Wide-Area Networks

natan
Download Presentation

Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees HideoSaito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo) Grid 2005 (Nov. 13, 2005)

  2. Message Passing in WANs • Increase in bandwidth of Wide-Area Networks • More opportunities to perform parallel comp. using multiple clusters connected by a WAN • Demands to use message passing for parallel computation in WANs • Existing programs written using message passing • Familiar programming model WAN

  3. Collective Operations • Communication operations in which all processors participate • E.g., broadcast and reduction • Importance (Gorlatch et al. 2004) • Easier to program using coll. operations than with just point-to-point operations (i.e., send/receive) • Libraries can provide a faster implementation than with user-level send/recv • Libraries can take advantage of knowledge of the underlying hardware Root Broadcast

  4. Coll. Ops. in LANs and WANs • Collective Operations for LANs • Optimized under the assumption that all links have the same latency/bandwidth • Collective Operations for WANs • Wide-area links are much slower than local-area links • Collective operations need to avoid links w/ high latency or low bandwidth • Existing methods use static Grid-aware trees constructed using manually-supplied information

  5. Problems of Existing Methods • Large-scale, long-lasting applications • More situations in which… • Different computational resources are used upon each invocation • Computational resources are automatically allocated by middleware • Processors are added/removed after app. startup • Difficult to manually supply topology information • Static trees used in existing methods won’t work

  6. Contribution of Our Work • Method to perform collective operations • Efficient in clustered wide-area systems • Doesn’t require manual configuration • Adapts to new topologies when processes are added/removed • Implementation for the Phoenix Message Passing System • Implementation for MPI (work in progress)

  7. Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work

  8. Root Root Short Long Binomial Tree Scatter Ring All-gather MPICH • Thakur et al. 2005 • Assumes that the latency and bandwidth of all links are the same • Short messages: latency-aware algorithm • Long messages: bandwidth-aware algorithm

  9. Kielmann et al. ’99 Separate static trees for wide-area and local-area communication Broadcast Root sends to the “coordinator node” of each cluster Coordinator nodes perform an MPICH-like broadcast within each cluster Root LAN Coord. Node Coord. Node LAN LAN MagPIe

  10. Other Works on Coll. Ops. for WANs • Other works that rely on manually-supplied info. • Network Performance-aware Collective Communication for Clustered Wide Area Systems (Kielmann et al. 2001) • MPICH-G2 (Karonis et al. 2003)

  11. Content Delivery Networks • Application-level multicast mechanisms using topology-aware overlay networks • Overcast (Jannotti et al. 2000) • SelectCast (Bozdog et al. 2003) • Banerjee et al. 2002 • Designed for content delivery; don’t work for message passing • Data loss • Single source • Only 1-to-N operations

  12. Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work

  13. Phoenix • Taura et al. (PPoPP2003) • Phoenix Programming Model • Message passing model • Programs are written using send/receive • Messages are addressed to virtual nodes • Strong support for addition/removal of processes during execution • Phoenix Message Passing Library • Message passing library based on the Phoenix Programming Model • Basis of our implementation

  14. send(30) send(30) 0-19 20-29 30-39 migration Addition/Removal of Processes • Virtual node namespace • Messages are addressed to virtual nodes instead of to processes • API to “migrate” virtual nodes supports addition/removal of processes during execution JOIN 20-39

  15. root root may be delivered together migration 0 1 2 3 Broadcast in Phoenix Broadcast in MPI Broadcast in Phoenix 4 4

  16. root root Op 0 1 2 3 4 Reduction in Phoenix Reduction in MPI Reduction in Phoenix Op

  17. Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work

  18. Overview of Our Proposal • Create topology-aware spanning trees at run-time • Latency-aware trees (for short messages) • Bandwidth-aware trees (for long messages) • Perform broadcasts/reductions along the generated trees • Update the trees when processes are added/removed

  19. Root RTT RTT RTT Spanning Tree Creation • Create a spanning tree for each process w/ that process at the root • Each process autonomously • Measures the RTT (or bandwidth) between itself and randomly selected other processes • Searches for a suitable parent for each spanning tree

  20. Goal Few edges between clusters Moderate fan-out and depth within clusters Parent selection RTTp,cand < RTTp,parent distcand,root < distp,root root distp,root p Latency-Aware Trees

  21. Goal Few edges between clusters Moderate fan-out and depth within clusters Parent selection RTTp,cand < RTTp,parent distcand,root < distp,root LAN Root LAN LAN Latency-Aware Trees

  22. Root (w/in cluster) Root Parent Change Parent Change Latency-Aware Trees • Goal • Few edges between clusters • Moderate fan-out and depth within clusters • Parent selection • RTTp,cand < RTTp,parent • distcand,root < distp,root

  23. distcand,root distp,root RTTp,parent RTTp,cand Latency-Aware Trees • Goal • Few edges between clusters • Moderate fan-out and depth within clusters • Parent selection • RTTp,cand < RTTp,parent • distcand,root < distp,root parent/root p cand p

  24. cand’s parent estcand cand bwp2p p nchildren Bandwidth-Aware Trees • Goal • Efficient use of bandwidth during a broadcast/reduction • Bandwidth estimation • estp,cand =min(estcand, bwp2p/(nchildren+1)) • Parent selection • estp,cand > estp,parent

  25. Short message (<128KB) Forward along a latency-aware tree Long message (>128KB) Pipeline along a bandwidth-aware tree Include in the header The set of virtual nodes to be forwarded via the receiving process 0 (root) {1,3,4} {2, 5} 1 2 {3} {4} {5} 5 3 4 5 migration Broadcast

  26. Each processor Waits for a message from all of its children Performs the specified operation Forwards the result to its parent Timeout mechanism To avoid waiting forever for a child that has already sent its message to another process Parent Change p Reduction Old parent New parent Parent Timeout

  27. Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work

  28. Broadcast (1-byte) • MPI-like broadcast • Mapped 1 virtual node to each process • 201 processes in 3 clusters MPICH-like (Grid-unaware) Impl. Our Implementation MagPIe-like (Static Grid-aware) Impl.

  29. Broadcast (Long) • MPI-like broadcast • Mapped 1 virtual node to each process • 137 processes in 4 clusters

  30. Reduction • MPI-like Reduction • Mapped 1 virtual node to each process • 128 processes in 3 clusters

  31. Rm. Add Addition/Removal of Processes • Repeated 4-MB broadcasts • 160 procs. in 4 clusters • Added/removed procs. while broadcasting • t = 0 [s] • 1 virtual node/process • t = 60 [s] • Remove half of the processes • t = 90 [s] • Re-add the removed processes

  32. Outline • Introduction • Related Work • Phoenix • Our Proposal • Experimental Results • Conclusion and Future Work

  33. Conclusion • Presented a method to perform broadcasts and reductions in WANs w/out manual configuration • Experiments • Stable-state broadcast/reduction • 1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe • Addition/removal of processes • Effective execution resumed 8 seconds after adding/removing processes

  34. Future Work • Optimize broadcast/reduction • Reduce the gap between our method and static Grid-enabled methods • Other collective operations • All-to-all • Barrier

  35. Thank you!

More Related