1 / 19

A Communication-Optimal N-Body Algorithm for Direct Interactions

A Communication-Optimal N-Body Algorithm for Direct Interactions. Michael Driscoll, Evangelos Georganas , Penporn Koanantakool , Edgar Solomonik , Katherine Yelick * UC Berkeley *Lawrence Berkeley National Laboratory. Overview. Intro to N-Body problem. Communication bounds.

wren
Download Presentation

A Communication-Optimal N-Body Algorithm for Direct Interactions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Communication-Optimal N-Body Algorithm for Direct Interactions Michael Driscoll, EvangelosGeorganas, PenpornKoanantakool, Edgar Solomonik, Katherine Yelick* UC Berkeley *Lawrence Berkeley National Laboratory

  2. Overview • Intro to N-Body problem. • Communication bounds. • Communication-optimal algorithm. • Performance results. • Conclusion

  3. Direct N-Body n particles • molecules, galaxies, database tuples, etc. • O(n2) interactions for i= 1 to n: for j = 1 to n: force[i] += interact( particles[i], particles[j] ) pprocessors

  4. Communication Model • Communication cost along critical path. • Alpha-beta model: • Can we find lower bounds on S or W? • Do current algorithms meet those bounds? • If not, can we find ones that do? or better bounds? 1/bandwidth # messages latency # words

  5. Communication Lower Bounds FromMinimizing Communication in Numerical Linear Algebra [Ballard et al. 2011]: F# flops Msize of fast memory Hmax flops per M words S # messages W # words Generalized in: Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays [Christ et al. 2013].

  6. Lower Bounds for N-Body Flops: Memory: Max flops per M words: Plug into latency and bandwidth lower bounds: Do current algorithms meet these bounds?

  7. A Naïve N-Body Algorithm Proc. 0 Proc. 1 Proc. 2 Proc. 3 Proc. 4 Proc. 5 … Proc. P • For p steps, send n/p particles. # messages: # words: • Recall bounds, and : ✔ ✔          particles: + + +                 replicas:

  8. The naïve algorithm is optimal… • Recall the lower bounds: • Notice M in denominator. • Increase M=> decrease communication. • Realize a “lower” lower bound.

  9. Communication-Optimal N-Body Team 0 Team 1 Team 2 Team 3 Team 4 Team 5 … Teamp/c          particles: • Replication factor: c copies of each particle • Communication cost: MessagesWords • Broadcast • Shifts • Reduction • Total +           processors:            c layers                       p/c teams • c = p1/2 => force decomposition [Plimpton 1995] reduce #messages by c2 reduce #words by c

  10. Experiments • Developed particle code • Flat MPI • 52-byte particles • Repulsive force drops off with square of distance • Reflective boundary conditions • Platforms • Hopper: Cray XE-6 at NERSC, 24 cores/node • Intrepid: IBM BlueGene/P at ALCF, 4 cores/node • Both have 3D torus interconnect.

  11. Performance on Hopper24K particles, 6Kcores Down is good 95.6% reduction

  12. Performance on Intrepid262K particles, 32K cores Down is good 99.3% reduction

  13. Strong Scaling on Intrepid262K particles Perfect Strong Scaling Up is Good 4.5x speedup

  14. CA N-Body with Cutoff Distance • No interactions beyond cutoff radius r • Assuming: • uniform particle distribution • spatial processor decomposition • Simple extension to support a cutoff: • still communication-optimal • works in space of any dimensions • speedups from 1D and 2D experiments

  15. N-Body with Cutoff cutoff diameter • Shifts occur modulo the cutoff distance. • Optimality holds • same counting argument • see paper for details Team 0 Team 1 Team 2 Team 3 Team 4 Team 5 … Teamp/c          particles: +           processors: c layers                      p/c teams

  16. 1D Simulation on Intrepid262K particles, 32K cores Down is good 84.6% reduction

  17. 2D Simulationon Hopper196K particles, 24K cores Down is good 74.8% reduction

  18. Strong Scaling on Hopper2D space, 24K cores, 196Kparticles Up is Good Good Strong Scaling

  19. Conclusions • By using c times more memory, we reduce: • Words sent along critical path: c. • Messages sent along critical path: c2. • Theory: maximize c. • Practice: tune for best c. • Saw 99.5% reduction in communication (11.8x speedup). • Applications beyond direct n-body: • collision detection algorithms • database joins • bottom solvers in hierarchical n-body codes

More Related