190 likes | 294 Views
A Communication-Optimal N-Body Algorithm for Direct Interactions. Michael Driscoll, Evangelos Georganas , Penporn Koanantakool , Edgar Solomonik , Katherine Yelick * UC Berkeley *Lawrence Berkeley National Laboratory. Overview. Intro to N-Body problem. Communication bounds.
E N D
A Communication-Optimal N-Body Algorithm for Direct Interactions Michael Driscoll, EvangelosGeorganas, PenpornKoanantakool, Edgar Solomonik, Katherine Yelick* UC Berkeley *Lawrence Berkeley National Laboratory
Overview • Intro to N-Body problem. • Communication bounds. • Communication-optimal algorithm. • Performance results. • Conclusion
Direct N-Body n particles • molecules, galaxies, database tuples, etc. • O(n2) interactions for i= 1 to n: for j = 1 to n: force[i] += interact( particles[i], particles[j] ) pprocessors
Communication Model • Communication cost along critical path. • Alpha-beta model: • Can we find lower bounds on S or W? • Do current algorithms meet those bounds? • If not, can we find ones that do? or better bounds? 1/bandwidth # messages latency # words
Communication Lower Bounds FromMinimizing Communication in Numerical Linear Algebra [Ballard et al. 2011]: F# flops Msize of fast memory Hmax flops per M words S # messages W # words Generalized in: Communication Lower Bounds and Optimal Algorithms for Programs That Reference Arrays [Christ et al. 2013].
Lower Bounds for N-Body Flops: Memory: Max flops per M words: Plug into latency and bandwidth lower bounds: Do current algorithms meet these bounds?
A Naïve N-Body Algorithm Proc. 0 Proc. 1 Proc. 2 Proc. 3 Proc. 4 Proc. 5 … Proc. P • For p steps, send n/p particles. # messages: # words: • Recall bounds, and : ✔ ✔ particles: + + + replicas:
The naïve algorithm is optimal… • Recall the lower bounds: • Notice M in denominator. • Increase M=> decrease communication. • Realize a “lower” lower bound.
Communication-Optimal N-Body Team 0 Team 1 Team 2 Team 3 Team 4 Team 5 … Teamp/c particles: • Replication factor: c copies of each particle • Communication cost: MessagesWords • Broadcast • Shifts • Reduction • Total + processors: c layers p/c teams • c = p1/2 => force decomposition [Plimpton 1995] reduce #messages by c2 reduce #words by c
Experiments • Developed particle code • Flat MPI • 52-byte particles • Repulsive force drops off with square of distance • Reflective boundary conditions • Platforms • Hopper: Cray XE-6 at NERSC, 24 cores/node • Intrepid: IBM BlueGene/P at ALCF, 4 cores/node • Both have 3D torus interconnect.
Performance on Hopper24K particles, 6Kcores Down is good 95.6% reduction
Performance on Intrepid262K particles, 32K cores Down is good 99.3% reduction
Strong Scaling on Intrepid262K particles Perfect Strong Scaling Up is Good 4.5x speedup
CA N-Body with Cutoff Distance • No interactions beyond cutoff radius r • Assuming: • uniform particle distribution • spatial processor decomposition • Simple extension to support a cutoff: • still communication-optimal • works in space of any dimensions • speedups from 1D and 2D experiments
N-Body with Cutoff cutoff diameter • Shifts occur modulo the cutoff distance. • Optimality holds • same counting argument • see paper for details Team 0 Team 1 Team 2 Team 3 Team 4 Team 5 … Teamp/c particles: + processors: c layers p/c teams
1D Simulation on Intrepid262K particles, 32K cores Down is good 84.6% reduction
2D Simulationon Hopper196K particles, 24K cores Down is good 74.8% reduction
Strong Scaling on Hopper2D space, 24K cores, 196Kparticles Up is Good Good Strong Scaling
Conclusions • By using c times more memory, we reduce: • Words sent along critical path: c. • Messages sent along critical path: c2. • Theory: maximize c. • Practice: tune for best c. • Saw 99.5% reduction in communication (11.8x speedup). • Applications beyond direct n-body: • collision detection algorithms • database joins • bottom solvers in hierarchical n-body codes