390 likes | 1.06k Views
Domain decomposition in parallel computing. COT 5410 – Spring 2004. Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University. Outline. Background Geometric partitioning Graph partitioning Static Dynamic Important points. Background.
E N D
Domain decomposition in parallel computing COT 5410 – Spring 2004 Ashok Srinivasan www.cs.fsu.edu/~asriniva Florida State University
Outline • Background • Geometric partitioning • Graph partitioning • Static • Dynamic • Important points
Background • Tasks in a parallel computation need access to certain data • Same datum may be needed by multiple tasks • Example: In matrix-vector multiplication, b2 is needed for the computation of all ci2, 1 < i < n • If a process does not “own” a datum needed by its task, then it has to get it from a process that has it • This communication is expensive • Aims of domain decomposition • * Distribute the data in such a manner that the communication required is minimized • * Ensure that the computational loads on processes are balanced
Domain decomposition example • Finite difference computation • New value of a node depends on old values of its neighbors • We want to divide the nodes amongst the processes so that • Communication is minimized • Measure of partition quality • Computational load is evenly balanced
Geometric partitioning • Partition a set of points • Uses only coordinate information • Balances the load • The heuristic tries to ensure that communication costs are low • Algorithms are typically fast, but partition not of high quality • Examples • Orthogonal recursive bisection • Inertial • Space filling curves
Orthogonal recursive bisection • Recursively bisect orthogonal to the longest dimension • Assume communication is proportional to the surface area of the domain, and aligned with coordinate axes • Recursive bisection • Divide into two pieces, keeping load balanced • Apply recursively, until desired number of partitions obtained
Inertial • ORB may not be effective if cuts along the x, y, or z directions are not good ones • Inertial • Recursively bisect orthogonal to the inertial axis
Space filling curves • * Space filling curves • A continuous curve that fills the space • Order the points based on their relative position on the curve • Choose a curve that preserves proximity • Points that are close in space should be close in the ordering too • Example • Hilbert curve
Hi H1 H2 Hi+1 Hilbert curve = lim Hn n Hilbert curve • Sources • http://www.dcs.napier.ac.uk/~andrew/hilbert.html • http://www.fractalus.com/kerry/tutorials/hilbert/hilbert-tutorial.html
Domain decomposition with a space filling curve • Order points based on their position on the curve • Divide into P parts • P is the number of processes • Space filling curves can be used in adaptive computations too • They can be extended to higher dimensions too
Graph partitioning • * Model as graph partitioning • Graph G = (V, E) • Each task is represented by a vertex • A weight can be used to represent the computational effort • An edge exists between tasks if one needs data owned by the other • Weights can be associated with edges too • Goal • Partition vertices into P parts such that each partition has equal vertex weights • Minimize the weights of edges cut • Problem is NP hard • Edge cut metric • Judge the quality of the partitioning by the number of edges cut
Static graph partitioning • Combinatorial • Levelized nested dissection • Kernighan-Lin/Feduccia-Matheyses • Spectral partitioning • Multi-level methods
Combinatorial partitioning • Use only connectivity information • Examples • Levelized nested dissection • Kernighan-Lin/Feduccia-Matheyses
Levelized nested dissection (LND) • Idea is similar to the geometric methods • But cannot use coordinate information • Instead of projecting vertices along the longest axis, order them based on distance from a vertex that may be one extreme of the longest dimension of a graph • Pseudo-peripheral vertex • Perform a breadth-first search, starting from an arbitrary vertex • The vertex that is encountered last might be a good approximation to a peripheral vertex
LND example Finding a pseudoperipheral vertex 3 2 3 2 1 3 1 2 Initial vertex 1 3 4 Pseudoperipheral vertex
LND example – Partitioning 5 6 3 4 5 2 5 4 2 3 1 Partition Initial vertex Recursively bisect the subgraphs
Kernighan-Lin/Fiduccia-Matheyses • Refines an existing partition • Kernighan-Lin • Consider pairs of vertices from different partitions • Choose a pair whose swapping will result in the best improvement in partition quality • The best improvement may actually be a worsening • Perform several passes • Choose best partition among those encountered • Fiduccia-Matheyses • Similar but more efficient • Boundary Kernighan-Lin • Consider only boundary vertices to swap • ... and many other variants
Kernighan-Lin example Swap these Better partition Edge cut = 3 Existing partition Edge cut = 4
Spectral method • Based on the observation that a Fiedler vector of a graph contains connectivity information • Laplacian of a graph: L • lii = di (degree of vertex i) • lij= -1 if edge {i,j} exists, otherwise 0 • Smallest eigenvalue of L is 0 with eigenvector all 1 • All other eigenvalues are positive for a connected graph • Fiedler vector • Eigenvector corresponding to the second smallest eigenvalue
Fiedler vector • Consider a partitioning of V into A and B • Let yi = 1 if vie A, and yi = -1 if vie B • For load balance, Si yi = 0 • Also Seije E (yi-yj)2 = 4 x number of edges across partitions • Also, yTLy = Si di yi2 – 2 Seije E yiyj = Seije E (yi-yj)2
Optimization problem • * The optimal partition is obtain by solving • Minimize yTLy • Constraints: • yie {-1,1} • Si yi = 0 • This is NP hard • Relaxed problem • Minimize yTLy • Constraints: • Si yi = 0 • Add a constraint on a norm of y, example, ||y||2 = n0.5 • Note • (1, 1, ..., 1)T is an eigenvector with eigenvalue 0 • For a connected graph, all other eigenvalues are positive and orthogonal to this eigenvector, which implies Si yi = 0 • The objective function is minimized by a Fiedler vector
Spectral algorithm • Find a Fiedler vector of the Laplacian of the graph • Note that the Fiedler value (the second smallest eigenvalue) yields a lower bound on the communication cost, when the load is balanced • From the Fiedler vector, bisect the graph • Let all vertices with components in the Fiedler vector greater than the median be in one component, and the rest in the other • Recursively apply this to each partition • Note: Finding the Fiedler vector of a large graph can be time consuming
Multilevel methods • Idea • It takes time to partition a large graph • So partition a small graph instead! • * Three phases • Graph coarsening • Combine vertices to create a smaller graph • Example: Find a suitable matching • Apply this recursively until a suitably small graph is obtained • Partitioning • Use spectral or another partitioning algorithm to partition the small graph • Multilevel refinement • Uncoarsen the graph to get a partitioning of the original graph • At each level, perform some graph refinement
Multilevel example(without refinement) 9 10 5 7 3 11 2 4 8 12 16 1 1 6 15 13 14
Multilevel example(without refinement) 9 10 5 7 3 11 2 4 8 12 16 1 1 6 15 13 14
Multilevel example(without refinement) 9 10 5 7 3 1 1 2 11 1 2 1 2 2 4 8 1 12 16 1 1 1 6 15 1 13 14
Multilevel example(without refinement) 9 10 5 7 3 1 1 2 11 1 2 1 2 2 4 8 1 12 16 1 1 6 15 1 13 14
Multilevel example(without refinement) 9 10 5 7 3 1 1 2 11 1 2 1 2 2 4 8 1 12 16 1 1 6 15 1 13 14 1 2 2 1
Dynamic partitioning • We have an initial partitioning • Now, the graph changes • * Determine a good partition, fast • * Also minimize the number of vertices that need to be moved • Examples • PLUM • Jostle • Diffusion
PLUM • Partition based on the initial mesh • Vertex and edge weights alone changed • Map partitions to processors • Use more partitions than processors • Ensures finer granularity • Compute a similarity matrix based on data already on a process • Measures savings on data redistribution cost for each (process, partition) pair • Choose assignment of partitions to processors • Example: Maximum weight matching • Duplicate each processor: # of partitions/P times • Alternative: Greedy approximation algorithm • Assign in order of maximum similarity value • http://citeseer.nj.nec.com/oliker98plum.html
JOSTLE • Use Hu and Blake’s scheme for load balancing • Solve Lx = b using Conjugate Gradient • L = Laplacian of processor graph, bi = Weight on process Pi – Average weight • Move max(xi-xj, 0) weight between Pi and Pj • Leads to balanced load • Equivalent to Pi sending xi load to each neighbor j, and each neighbor Pj sending xj to Pi • Net loss in load for Pi = dixi - Sneighborjxj = L(i)x = bi • where L(i) is row i of L, and di is degree of i • New load for Pi = weight on Pi - bi = average weight • Leads to minimum L2 norm of load moved • Using max(xi-xj, 0) • Select vertices to move, based on relative gain • http://citeseer.nj.nec.com/walshaw97parallel.html
Diffusion • Involves only communication with neighbors • A simple scheme • Processor Pi repeatedly sends a wi weight to each neighbor • wi = weight on Pi • wk = (I – a L) wk-1 , wk = weight vector at iteration k • Simple criteria exist for choosing a to ensure convergence • Example: a = 0.5/(maxi di), • More sophisticated schemes exist
Important points • Goals of domain decomposition • Balance the load • Minimize communication • Space filling curves • Graph partitioning model • Spectral method • Relax NP hard integer optimization to floating point, and then discretize to get approximate integer solution • Multilevel methods • Three phases • Dynamic partitioning – additional requirements • Use old solution to find new one fast • Minimize number of vertices moved