270 likes | 585 Views
How Scalable is Your Load Balancer?. Ümit V. Çatalyürek , Doruk Bozda ğ The Ohio State University Karen D. Devine, Erik G. Boman Sandia National Laboratories
E N D
How Scalable is Your Load Balancer? Ümit V. Çatalyürek, Doruk Bozdağ The Ohio State University Karen D. Devine, Erik G. Boman Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
= b A x Partitioning and Load Balancing • Goal: assign data to processors to • minimize application runtime • maximize utilization of computing resources • Metrics: • minimize processor idle time (balance workloads) • keep inter-processor communication costs low • Impacts performance of a wide range of simulations Linear solvers & preconditioners Adaptive mesh refinement Contact detection Particle simulations
Roadmap • Background • Graph Partitioning and Partitioning Tools • Hypergraph Partitioning and Partitioning Tools • State-of-the-art: Parallel Multilevel Partitioning • Data Layout/Decomposition • Recursive Bisection • Multilevel Partitioning • Coarsening • Coarse Partitioning • Refinement • Results
Graph Partitioning • Work-horse of load-balancing community. • Highly successful model for PDE problems. • Model problem as a graph: • vertices = work associated with data (computation) • edges = relationships between data/computation (communication) • Goal: Evenly distribute vertex weight while minimizing weight of cut edges. • Many successful algorithms • Kernighan, Lin, Simon, Hendrickson, Leland, Kumar, Karypis, et al. • Excellent software available • Serial: Chaco (SNL), Jostle (U. Greenwich), METIS (U. Minn.), Party (U. Paderborn), Scotch (U. Bordeaux) • Parallel: ParMETIS (U. Minn.), PJostle (U. Greenwich)
Hexahedral finiteelement matrix Linear programming matrix for sensor placement Limited Applicability of Graph Models • Assume symmetric square problems. • Symmetric = undirected graph. • Square = inputs and outputs of operation are same size. • Do not naturally support: • Non-symmetric systems. • Require directed or bipartite graph. • Partition A+AT. • Rectangular systems. • Require decompositions fordifferently sized inputs and outputs. • Partition AAT.
P2 P1 Vj Vk Vi Vl Vm Vh P3 P4 Hexahedral finiteelement matrix Xyce ASIC matrix Approximate Communication Metric in Graph Models • Graph models assume • Weight of edge cuts = Communication volume. • But edge cuts only approximate communication volume. • Good enough for many PDE applications. • Not good enough for irregular problems
Hypergraph Partitioning • Work-horse of VLSI community • Recently adopted by scientific community • Model problem as a hypergraph (Çatalyürek & Aykanat): • vertices = work associated with data (computation) • edges = data elements, computation-to-data dependencies (communication) • Goal: Evenly distribute vertex weight while minimizing cut size. • Many successful algorithms • Kernighan, Schweikert, Fiduccia, Mattheyes, Sanchis, Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al. • Excellent serial software available • hMETIS (Karypis), – PaToH (Çatalyürek) • Mondriaan (Bisseling) • Parallel partitioners needed for large, dynamic problems. • Zoltan PHG (Sandia) – Parkway (Trifunovic)
P1 Vj nk Vk Vi nl Vl ni nm Vm nh Vh P3 P4 Impact of Hypergraph Models • Greater expressiveness Greater applicability. • Structurally non-symmetric systems • circuits, biology • Rectangular systems • linear programming, least-squares methods • Non-homogeneous, highly connected topologies • circuits, nanotechnology, databases • Multiple models for different granularity partitioning • Owner compute, fine-grain, checkerboard/cartesian, Mondriaan • Accurate communication model lower application communication costs. Mondriaan Partitioning Courtesy of Rob Bisseling
Design Decisions • Single-level vs Multi-level [serial/parallel] • Direct k-way vs Recursive Bisection [serial/parallel] • Data Layout [parallel] • Chicken-and-egg Problem
Recursive Bisection • Recursive bisection approach: • Partition data into two sets. • Recursively subdivide each set into two sets. • Only minor modifications needed to allow P ≠ 2n. • How to parallelize: • Start on single node; use additional nodes in each recursion • Doesn’t solve Memory Problem • Solve each recursion (hence each level of multilevel) in parallel • At each recursion two split options: • Split only the data into two sets; use all processors to compute each branch. • Split both the data and processors into two sets; solve branches in parallel.
Parallel Data Layout: Hypergraph • Zoltan uses 2D data layout within hypergraph partitioner. • Matrix representation of Hypergraphs (Çatalyürek & Aykanat) • vertices == columns • nets (hyperedges) == rows • Vertex/hyperedge broadcasts to only sqrt(P) processors. • Maintain scalable memory usage. • No “ghosting” of off-processor neighbor info. • Differs from parallel graph partitioners and Parkway.
Coarsening via Parallel Matching • Vertex-based greedy maximal weight matching algorithms • Graph: Heavy Edge Matching • Hypergraph: Heavy connectivity matching (Çatalyürek) Inner-product matching (Bisseling) • Match columns (vertices) with greatest inner product greatest similarity in connectivity • BSP style: with and without Coloring/PhaseS or Rounds • ParMETIS: 2C or 2S (S~=4) communication steps • Each processor concurrently decides possibly conflincting matching • 2 communications to resolve conflicts • Zoltan: 4R communication steps (each w/ max procs) • Broadcast subset of vertices (“candidates”) along processor row • Compute (partial) inner products of received candidates with local vertices • Accrue inner products in processor column • Identify best local matches for received candidates • Send best matches to candidates’ owners • Select best global match for each owned candidate • Send “match accepted” messages to processors owning matched vertices
Contraction • Merge matched vertices • Graph/Hypergraph: communicate match array • Graph: • communicate adj list of a matched vertices from different processors • Hypergraph: • Merging vertices reduces only one dimension • To reduce the other dimension: • Remove size 1 nets • Remove identical nets • Compute hash via horizontal communication • Compare nets via vertical communication
Coarse Partitioning • Gathers coarsest graph/hypergraph to each processor • Bisection/k-way (Zoltan): • Compute several different coarse partitions on each processor • Select best local partition • Compute best over all processors • k-way (ParMETIS): • Do recursive bisection; each proc follows single path in the recursion
Uncoarsening/Refinement • Project coarse partition to finer graph/hypergraph • Use local optimization (KL/FM) to improve balance and reduce cuts • Repeat until you reach finest graph/hypergraph • Each FM round consist of number of phaseS (Colors) • Graph: C or S=2 communication steps per round (max 2 rounds) • Hypergraph: 3 communication steps per round (max 10 rounds) but concurrency is limited to • Compute “root” processor in each processor column: processor with most nonzeros • Compute and communicate pin distribution • Compute andcommunicate gains • Root processor computes moves for vertices in processor column • Communicate best moves and update pin distribution
Xyce ASIC Stripped Cage Electrophoresis Experimental Results • Experiments on • Thunderbird cluster • 4,480 compute nodes connected with Infiniband • Dual 3.6 GHz Intel EM64T processors with 6 GB RAM • BMI-RI cluster • 64 compute nodes connected with Infiniband • Dual 2.4 GHz AMD Opteron processors with 8 GB RAM • Zoltan v2.1 hypergraph partitioner & ParMETIS v3.1 graph partitioner • Test problems: • Xyce ASIC Stripped: 680K x 680K; 2.3M nonzeros • Cage15 Electrophoresis: 5.1M x 5.1M; 99M nonzeros • Mrng4: 7.5M x 7.5M; ~67M nonzeros • 2DGrid: 16M x 16M; ~88M nonzeros
Results: Weak Scaling • On BMI-RI cluster • cage12 ~130K vertices, cage13 is 3.4x, cage14 is 11.5x, and cage15 is 39.5x of cage12
Results: Weak Scaling • On BMI-RI cluster • ParMETIS on Cray T3E uses SHMEM • mrng1 ~500K vertices (p=2), mrng2 is 4x, mrng3 is 16x, mrng4 is 32x of mrng1
Results: Weak Scaling • On BMI-RI cluster • On the average 15% smaller communication volume with hypergraph!?
Results: Strong Scaling • On Thunderbird cluster • On the average 17% smaller communication volume with hypergraph
Results: Strong Scaling • On Thunderbird cluster • Balance problems with ParMETIS
Conclusion • Do we need parallel partitioning? • It is a must but not for speed; because of memory! Big problems need big machines! • Do load-balancers scale? • Yes they do scale: but they need to be improved to scale to Petascale • Memory usage becomes a problem for larger problems and #Procs • One size won’t fit all : many things depends on application • Model depends on application • Trade-Offs • Techniques may depend on application/data • Performance depends on application/data => bring us more applications/data :)
CSCAPES:Combinatorial Scientific Computing and Petascale Simulation • A SciDAC Institute Funded by DOE’s Office of Science Alex Pothen, Florin Dobrian, Assefaw Gebremedhin Erik Boman, Karen Devine, Bruce Hendrickson Paul Hovland, Boyana Norris, Jean Utke Michelle Strout Umit Catalyurek www.cscapes.org