230 likes | 568 Views
Abhinav Bhatelé Laxmikant V. Kalé. Application-specific Topology-aware Mapping for Three Dimensional Topologies. Outline. Motivation The Mapping Problem Static Mapping: 3D Stencil Load Balancing: NAMD Future Work. The network latency for wormhole routing is (L f /B)*D + L/B
E N D
Abhinav Bhatelé Laxmikant V. Kalé Application-specific Topology-aware Mapping for Three Dimensional Topologies
Outline • Motivation • The Mapping Problem • Static Mapping: 3D Stencil • Load Balancing: NAMD • Future Work
The network latency for wormhole routing is (Lf/B)*D + L/B Lf = Length of each flit, B = bandwidth D = number of hops, L = length of message Lionel M. Ni and Philip K. McKinley, “A Survey of Wormhole Routing Techniques in Direct Networks”, Computer, Volume 26, Issue 2, pages 62-76, 1993
Message Latencies NN = Near Neighbor, RND = Random
Hardware Latencies • Blue Gene/L • Near neighbor: < 1 µs • Worst case: 7 µs • Blue Gene/P • Near neighbor: < 1 µs • Worst case: 5 µs • Corresponding differences for MPI messages
Topology-aware mapping • Problem: Given a object communication graph and a processor graph, find an optimal mapping • Minimizes communication • Ensure load balance • Metric for communication traffic • Hop-bytes = number of links (hops) traversed X message size
Machine Topology • Information required at runtime • No. of processors in the allocated partition • No. of processors along each dimension • Physical coordinates of each processor
Communication Graph • Static • 3D Stencil: regular communication graph • Dynamic • Molecular dynamics application • Changes as atoms migrate from one processor to another
Dynamic Graph - NAMD • Molecular Dynamics (MD) application • Simulation box is a 3D cell full of atoms
Load Balancing in NAMD • Measurement-based (Charm++) • Principle of persistence • Patches are statically mapped • Orthogonal recursive bisection • Computes can be migrated • Load balancing framework gathers the communication information • Goal • Minimize communication • Maximize load balance
Old strategy • Greedy approach • Pick the heaviest compute • Place it on a processor with one of the patches OR • On a processor which already has a compute for this patch
Hop-bytes ~17 %
Future Work • Reason for contention • Heavy communication exceeding bandwidth • Link contention (such as in deterministic routing) • Use UPC/PAPI on Blue Gene/L and P
Future Work • Automatic Mapping • Initial Static Mapping • Use case – meshing applications • Extend work on the Charm++ load balancers • Section-multicast aware load balancers • Useful in matrix multiplication
Future Work • Optimization on other topologies • SiCortex (Kautz Graph) • Infiniband clusters (Fat-tree)
Summary • Topology mapping helps! • Especially heavily communication bound applications • Static mapping • Dynamic mapping during load balancing • Automatic mapping to relieve the user