Domain decomposition in parallel computing

Domain decomposition in parallel computing COT 5410 – Spring 2004 Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University

Outline • Background • Geometric partitioning • Graph partitioning • Static • Dynamic • Important points

Background • Tasks in a parallel computation need access to certain data • Same datum may be needed by multiple tasks • Example: In matrix-vector multiplication, b2 is needed for the computation of all ci2, 1 < i < n • If a process does not “own” a datum needed by its task, then it has to get it from a process that has it • This communication is expensive • Aims of domain decomposition • * Distribute the data in such a manner that the communication required is minimized • * Ensure that the computational loads on processes are balanced

Domain decomposition example • Finite difference computation • New value of a node depends on old values of its neighbors • We want to divide the nodes amongst the processes so that • Communication is minimized • Measure of partition quality • Computational load is evenly balanced

Geometric partitioning • Partition a set of points • Uses only coordinate information • Balances the load • The heuristic tries to ensure that communication costs are low • Algorithms are typically fast, but partition not of high quality • Examples • Orthogonal recursive bisection • Inertial • Space filling curves

Orthogonal recursive bisection • Recursively bisect orthogonal to the longest dimension • Assume communication is proportional to the surface area of the domain, and aligned with coordinate axes • Recursive bisection • Divide into two pieces, keeping load balanced • Apply recursively, until desired number of partitions obtained

Inertial • ORB may not be effective if cuts along the x, y, or z directions are not good ones • Inertial • Recursively bisect orthogonal to the inertial axis

Space filling curves • * Space filling curves • A continuous curve that fills the space • Order the points based on their relative position on the curve • Choose a curve that preserves proximity • Points that are close in space should be close in the ordering too • Example • Hilbert curve

Hi H1 H2 Hi+1 Hilbert curve = lim Hn n Hilbert curve • Sources • http://www.dcs.napier.ac.uk/~andrew/hilbert.html • http://www.fractalus.com/kerry/tutorials/hilbert/hilbert-tutorial.html

Domain decomposition with a space filling curve • Order points based on their position on the curve • Divide into P parts • P is the number of processes • Space filling curves can be used in adaptive computations too • They can be extended to higher dimensions too

Graph partitioning • * Model as graph partitioning • Graph G = (V, E) • Each task is represented by a vertex • A weight can be used to represent the computational effort • An edge exists between tasks if one needs data owned by the other • Weights can be associated with edges too • Goal • Partition vertices into P parts such that each partition has equal vertex weights • Minimize the weights of edges cut • Problem is NP hard • Edge cut metric • Judge the quality of the partitioning by the number of edges cut

Static graph partitioning • Combinatorial • Levelized nested dissection • Kernighan-Lin/Feduccia-Matheyses • Spectral partitioning • Multi-level methods

Combinatorial partitioning • Use only connectivity information • Examples • Levelized nested dissection • Kernighan-Lin/Feduccia-Matheyses

Levelized nested dissection (LND) • Idea is similar to the geometric methods • But cannot use coordinate information • Instead of projecting vertices along the longest axis, order them based on distance from a vertex that may be one extreme of the longest dimension of a graph • Pseudo-peripheral vertex • Perform a breadth-first search, starting from an arbitrary vertex • The vertex that is encountered last might be a good approximation to a peripheral vertex

LND example Finding a pseudoperipheral vertex 3 2 3 2 1 3 1 2 Initial vertex 1 3 4 Pseudoperipheral vertex

LND example – Partitioning 5 6 3 4 5 2 5 4 2 3 1 Partition Initial vertex Recursively bisect the subgraphs

Kernighan-Lin/Fiduccia-Matheyses • Refines an existing partition • Kernighan-Lin • Consider pairs of vertices from different partitions • Choose a pair whose swapping will result in the best improvement in partition quality • The best improvement may actually be a worsening • Perform several passes • Choose best partition among those encountered • Fiduccia-Matheyses • Similar but more efficient • Boundary Kernighan-Lin • Consider only boundary vertices to swap • ... and many other variants

Kernighan-Lin example Swap these Better partition Edge cut = 3 Existing partition Edge cut = 4

Spectral method • Based on the observation that a Fiedler vector of a graph contains connectivity information • Laplacian of a graph: L • lii = di (degree of vertex i) • lij= -1 if edge {i,j} exists, otherwise 0 • Smallest eigenvalue of L is 0 with eigenvector all 1 • All other eigenvalues are positive for a connected graph • Fiedler vector • Eigenvector corresponding to the second smallest eigenvalue

Fiedler vector • Consider a partitioning of V into A and B • Let yi = 1 if vie A, and yi = -1 if vie B • For load balance, Si yi = 0 • Also Seije E (yi-yj)2 = 4 x number of edges across partitions • Also, yTLy = Si di yi2 – 2 Seije E yiyj = Seije E (yi-yj)2

Optimization problem • * The optimal partition is obtain by solving • Minimize yTLy • Constraints: • yie {-1,1} • Si yi = 0 • This is NP hard • Relaxed problem • Minimize yTLy • Constraints: • Si yi = 0 • Add a constraint on a norm of y, example, ||y||2 = n0.5 • Note • (1, 1, ..., 1)T is an eigenvector with eigenvalue 0 • For a connected graph, all other eigenvalues are positive and orthogonal to this eigenvector, which implies Si yi = 0 • The objective function is minimized by a Fiedler vector

Spectral algorithm • Find a Fiedler vector of the Laplacian of the graph • Note that the Fiedler value (the second smallest eigenvalue) yields a lower bound on the communication cost, when the load is balanced • From the Fiedler vector, bisect the graph • Let all vertices with components in the Fiedler vector greater than the median be in one component, and the rest in the other • Recursively apply this to each partition • Note: Finding the Fiedler vector of a large graph can be time consuming

Multilevel methods • Idea • It takes time to partition a large graph • So partition a small graph instead! • * Three phases • Graph coarsening • Combine vertices to create a smaller graph • Example: Find a suitable matching • Apply this recursively until a suitably small graph is obtained • Partitioning • Use spectral or another partitioning algorithm to partition the small graph • Multilevel refinement • Uncoarsen the graph to get a partitioning of the original graph • At each level, perform some graph refinement

Multilevel example(without refinement) 9 10 5 7 3 11 2 4 8 12 16 1 1 6 15 13 14

Multilevel example(without refinement) 9 10 5 7 3 1 1 2 11 1 2 1 2 2 4 8 1 12 16 1 1 1 6 15 1 13 14

Multilevel example(without refinement) 9 10 5 7 3 1 1 2 11 1 2 1 2 2 4 8 1 12 16 1 1 6 15 1 13 14

Multilevel example(without refinement) 9 10 5 7 3 1 1 2 11 1 2 1 2 2 4 8 1 12 16 1 1 6 15 1 13 14 1 2 2 1

Dynamic partitioning • We have an initial partitioning • Now, the graph changes • * Determine a good partition, fast • * Also minimize the number of vertices that need to be moved • Examples • PLUM • Jostle • Diffusion

PLUM • Partition based on the initial mesh • Vertex and edge weights alone changed • Map partitions to processors • Use more partitions than processors • Ensures finer granularity • Compute a similarity matrix based on data already on a process • Measures savings on data redistribution cost for each (process, partition) pair • Choose assignment of partitions to processors • Example: Maximum weight matching • Duplicate each processor: # of partitions/P times • Alternative: Greedy approximation algorithm • Assign in order of maximum similarity value • http://citeseer.nj.nec.com/oliker98plum.html

JOSTLE • Use Hu and Blake’s scheme for load balancing • Solve Lx = b using Conjugate Gradient • L = Laplacian of processor graph, bi = Weight on process Pi – Average weight • Move max(xi-xj, 0) weight between Pi and Pj • Leads to balanced load • Equivalent to Pi sending xi load to each neighbor j, and each neighbor Pj sending xj to Pi • Net loss in load for Pi = dixi - Sneighborjxj = L(i)x = bi • where L(i) is row i of L, and di is degree of i • New load for Pi = weight on Pi - bi = average weight • Leads to minimum L2 norm of load moved • Using max(xi-xj, 0) • Select vertices to move, based on relative gain • http://citeseer.nj.nec.com/walshaw97parallel.html

Diffusion • Involves only communication with neighbors • A simple scheme • Processor Pi repeatedly sends a wi weight to each neighbor • wi = weight on Pi • wk = (I – a L) wk-1 , wk = weight vector at iteration k • Simple criteria exist for choosing a to ensure convergence • Example: a = 0.5/(maxi di), • More sophisticated schemes exist

Important points • Goals of domain decomposition • Balance the load • Minimize communication • Space filling curves • Graph partitioning model • Spectral method • Relax NP hard integer optimization to floating point, and then discretize to get approximate integer solution • Multilevel methods • Three phases • Dynamic partitioning – additional requirements • Use old solution to find new one fast • Minimize number of vertices moved

Domain decomposition in parallel computing

Domain decomposition in parallel computing

Presentation Transcript

Parallel Computing

Parallel Computing in Chemistry

Parallel Computing

Parallel Computing in Matlab

Parallel Computing Explained Parallel Computing Overview

Chemistry in Parallel Computing

Parallel Computing

Parallel Computing

Chemistry in Parallel Computing

Parallel computing

Parallel Computing

Parallel Decomposition Methods

Parallel Computing

OpenMP for Domain Decomposition

Parallel Computing in Matlab

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing

Parallel Computing

Parallel computing

Domain Decomposition and Parallel Finite Element Methods