How Scalable is Your Load Balancer?

How Scalable is Your Load Balancer? Ümit V. Çatalyürek, Doruk Bozdağ The Ohio State University Karen D. Devine, Erik G. Boman Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

= b A x Partitioning and Load Balancing • Goal: assign data to processors to • minimize application runtime • maximize utilization of computing resources • Metrics: • minimize processor idle time (balance workloads) • keep inter-processor communication costs low • Impacts performance of a wide range of simulations Linear solvers & preconditioners Adaptive mesh refinement Contact detection Particle simulations

Roadmap • Background • Graph Partitioning and Partitioning Tools • Hypergraph Partitioning and Partitioning Tools • State-of-the-art: Parallel Multilevel Partitioning • Data Layout/Decomposition • Recursive Bisection • Multilevel Partitioning • Coarsening • Coarse Partitioning • Refinement • Results

Graph Partitioning • Work-horse of load-balancing community. • Highly successful model for PDE problems. • Model problem as a graph: • vertices = work associated with data (computation) • edges = relationships between data/computation (communication) • Goal: Evenly distribute vertex weight while minimizing weight of cut edges. • Many successful algorithms • Kernighan, Lin, Simon, Hendrickson, Leland, Kumar, Karypis, et al. • Excellent software available • Serial: Chaco (SNL), Jostle (U. Greenwich), METIS (U. Minn.), Party (U. Paderborn), Scotch (U. Bordeaux) • Parallel: ParMETIS (U. Minn.), PJostle (U. Greenwich)

Hexahedral finiteelement matrix Linear programming matrix for sensor placement Limited Applicability of Graph Models • Assume symmetric square problems. • Symmetric = undirected graph. • Square = inputs and outputs of operation are same size. • Do not naturally support: • Non-symmetric systems. • Require directed or bipartite graph. • Partition A+AT. • Rectangular systems. • Require decompositions fordifferently sized inputs and outputs. • Partition AAT.

P2 P1 Vj Vk Vi Vl Vm Vh P3 P4 Hexahedral finiteelement matrix Xyce ASIC matrix Approximate Communication Metric in Graph Models • Graph models assume • Weight of edge cuts = Communication volume. • But edge cuts only approximate communication volume. • Good enough for many PDE applications. • Not good enough for irregular problems

Hypergraph Partitioning • Work-horse of VLSI community • Recently adopted by scientific community • Model problem as a hypergraph (Çatalyürek & Aykanat): • vertices = work associated with data (computation) • edges = data elements, computation-to-data dependencies (communication) • Goal: Evenly distribute vertex weight while minimizing cut size. • Many successful algorithms • Kernighan, Schweikert, Fiduccia, Mattheyes, Sanchis, Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al. • Excellent serial software available • hMETIS (Karypis), – PaToH (Çatalyürek) • Mondriaan (Bisseling) • Parallel partitioners needed for large, dynamic problems. • Zoltan PHG (Sandia) – Parkway (Trifunovic)

P1 Vj nk Vk Vi nl Vl ni nm Vm nh Vh P3 P4 Impact of Hypergraph Models • Greater expressiveness  Greater applicability. • Structurally non-symmetric systems • circuits, biology • Rectangular systems • linear programming, least-squares methods • Non-homogeneous, highly connected topologies • circuits, nanotechnology, databases • Multiple models for different granularity partitioning • Owner compute, fine-grain, checkerboard/cartesian, Mondriaan • Accurate communication model  lower application communication costs. Mondriaan Partitioning Courtesy of Rob Bisseling

Parallel Graph/Hypergraph Partitioning

Design Decisions • Single-level vs Multi-level [serial/parallel] • Direct k-way vs Recursive Bisection [serial/parallel] • Data Layout [parallel] • Chicken-and-egg Problem

Multilevel Partitioning Scheme

Recursive Bisection • Recursive bisection approach: • Partition data into two sets. • Recursively subdivide each set into two sets. • Only minor modifications needed to allow P ≠ 2n. • How to parallelize: • Start on single node; use additional nodes in each recursion • Doesn’t solve Memory Problem • Solve each recursion (hence each level of multilevel) in parallel • At each recursion two split options: • Split only the data into two sets; use all processors to compute each branch. • Split both the data and processors into two sets; solve branches in parallel.

Parallel Data Layout: Graph

Parallel Data Layout: Hypergraph • Zoltan uses 2D data layout within hypergraph partitioner. • Matrix representation of Hypergraphs (Çatalyürek & Aykanat) • vertices == columns • nets (hyperedges) == rows • Vertex/hyperedge broadcasts to only sqrt(P) processors. • Maintain scalable memory usage. • No “ghosting” of off-processor neighbor info. • Differs from parallel graph partitioners and Parkway.

Coarsening via Parallel Matching • Vertex-based greedy maximal weight matching algorithms • Graph: Heavy Edge Matching • Hypergraph: Heavy connectivity matching (Çatalyürek) Inner-product matching (Bisseling) • Match columns (vertices) with greatest inner product  greatest similarity in connectivity • BSP style: with and without Coloring/PhaseS or Rounds • ParMETIS: 2C or 2S (S~=4) communication steps • Each processor concurrently decides possibly conflincting matching • 2 communications to resolve conflicts • Zoltan: 4R communication steps (each w/ max procs) • Broadcast subset of vertices (“candidates”) along processor row • Compute (partial) inner products of received candidates with local vertices • Accrue inner products in processor column • Identify best local matches for received candidates • Send best matches to candidates’ owners • Select best global match for each owned candidate • Send “match accepted” messages to processors owning matched vertices

Contraction • Merge matched vertices • Graph/Hypergraph: communicate match array • Graph: • communicate adj list of a matched vertices from different processors • Hypergraph: • Merging vertices reduces only one dimension • To reduce the other dimension: • Remove size 1 nets • Remove identical nets • Compute hash via horizontal communication • Compare nets via vertical communication

Coarse Partitioning • Gathers coarsest graph/hypergraph to each processor • Bisection/k-way (Zoltan): • Compute several different coarse partitions on each processor • Select best local partition • Compute best over all processors • k-way (ParMETIS): • Do recursive bisection; each proc follows single path in the recursion

Uncoarsening/Refinement • Project coarse partition to finer graph/hypergraph • Use local optimization (KL/FM) to improve balance and reduce cuts • Repeat until you reach finest graph/hypergraph • Each FM round consist of number of phaseS (Colors) • Graph: C or S=2 communication steps per round (max 2 rounds) • Hypergraph: 3 communication steps per round (max 10 rounds) but concurrency is limited to • Compute “root” processor in each processor column: processor with most nonzeros • Compute and communicate pin distribution • Compute andcommunicate gains • Root processor computes moves for vertices in processor column • Communicate best moves and update pin distribution

Xyce ASIC Stripped Cage Electrophoresis Experimental Results • Experiments on • Thunderbird cluster • 4,480 compute nodes connected with Infiniband • Dual 3.6 GHz Intel EM64T processors with 6 GB RAM • BMI-RI cluster • 64 compute nodes connected with Infiniband • Dual 2.4 GHz AMD Opteron processors with 8 GB RAM • Zoltan v2.1 hypergraph partitioner & ParMETIS v3.1 graph partitioner • Test problems: • Xyce ASIC Stripped: 680K x 680K; 2.3M nonzeros • Cage15 Electrophoresis: 5.1M x 5.1M; 99M nonzeros • Mrng4: 7.5M x 7.5M; ~67M nonzeros • 2DGrid: 16M x 16M; ~88M nonzeros

Results: Weak Scaling • On BMI-RI cluster • cage12 ~130K vertices, cage13 is 3.4x, cage14 is 11.5x, and cage15 is 39.5x of cage12

Results: Weak Scaling • On BMI-RI cluster • ParMETIS on Cray T3E uses SHMEM • mrng1 ~500K vertices (p=2), mrng2 is 4x, mrng3 is 16x, mrng4 is 32x of mrng1

Results: Weak Scaling • On BMI-RI cluster • On the average 15% smaller communication volume with hypergraph!?

Results: Strong Scaling • On Thunderbird cluster • On the average 17% smaller communication volume with hypergraph

Results: Strong Scaling • On Thunderbird cluster • Balance problems with ParMETIS

Conclusion • Do we need parallel partitioning? • It is a must but not for speed; because of memory! Big problems need big machines! • Do load-balancers scale? • Yes they do scale: but they need to be improved to scale to Petascale • Memory usage becomes a problem for larger problems and #Procs • One size won’t fit all : many things depends on application • Model depends on application • Trade-Offs • Techniques may depend on application/data • Performance depends on application/data => bring us more applications/data :)

CSCAPES:Combinatorial Scientific Computing and Petascale Simulation • A SciDAC Institute Funded by DOE’s Office of Science Alex Pothen, Florin Dobrian, Assefaw Gebremedhin Erik Boman, Karen Devine, Bruce Hendrickson Paul Hovland, Boyana Norris, Jean Utke Michelle Strout Umit Catalyurek www.cscapes.org

How Scalable is Your Load Balancer?

How Scalable is Your Load Balancer?

Presentation Transcript

20-755: The Internet Lecture 12: Scalable services

Scalable Software Verification with BLAST

Lecture 07 DC and AC Load Line

Advanced Load Balancing/Web Systems

Scalable Data Mining

Heat load due to e-cloud in the HL-LHC triplets

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments

A Scalable Information Management Middleware for Large Distributed Systems

Scalable Information Extraction

Palletized Load System

Scalable, Controllable, Efficient and Convincing Crowd Simulation

Scalable Apache for Beginners

Scalable Web Architectures

APV Technical Training

LOAD RESTRAINTS

Hydrostatic Bearing Systems

Scalable and transparent parallelization of multiplayer games

Lessons Learned in Building a Highly Scalable MySQL Database

Heat Gains

UFE 2005 ANALYSIS