360 likes | 371 Views
This paper discusses the partitioning of matrices on a grid of heterogeneous processors, ensuring balanced computation load and optimal performance. It presents an algorithm based on 2D block cyclic ScaLAPACK and proposes a generalized block partitioning approach.
E N D
On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of Computer Science and Informatics University College Dublin Alexey.Lastovetsky@ucd.ie ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Heterogeneous parallel computing • Heterogeneity of processors • The processors run at different speeds • Even distribution of computations do not balance processors’ load • The performance is determined by the slowest processor • Data must be distributed unevenly • So that each processor will perform the volume of computation proportional to its speed ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Constant performance models of heterogeneous processors • The simplest performance model of heterogeneous processors • p, the number of the processors, • S={s1, s2, ..., sp}, the speeds of the processors (positive constants). • The speed • Absolute: the number of computational units performed by the processor per one time unit • Relative: • Some use the execution time: ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Data distribution problems with constant models of heterogeneous processors • Typical design of heterogeneous parallel algorithms • Problem of distribution of computations in proportion to the speed of processors • Problem of partitioning of some mathematical objects • Sets, matrices, graphs, geometric figures, etc. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors • Matrices • Most widely used math. objects in scientific computing • Studied partitioning problems mainly deal with matrices • Matrix partitioning in one dimension over a 1D arrangement of processors • Often reduced to partitioning sets or well-ordered sets • Design of algorithms often results in matrix partitioning problems not imposing the restriction of partitioning in one dimension • E.g., in parallel linear algebra for heterogeneous platforms • We will use matrix multiplication • A simple but very important linear algebra kernel ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors (ctd) • A heterogeneous matrix multiplication algorithm • A modification of some homogeneous one • Most often, of the 2D block cyclic ScaLAPACK algorithm ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors (ctd) • 2D block cyclic ScaLAPACK MM algorithm (ctd) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors (ctd) • 2D block cyclic ScaLAPACK MM algorithm (ctd) • The matrices are identically partitioned into rectangular generalized blocks of the size (p×r)×(q×r) • Each generalized block forms a 2D p×q grid of r×r blocks • There is 1-to-1 mapping between this grid of blocks and the p×qprocessor grid • At each step of the algorithm • Each processor not owing the pivot row and column receives horizontally (n/p)×relements of matrix Aand vertically (n/q)×relements of matrix B • => in total, , i.e., ~ the half-perimeter of the rectangle area allocated to the processor ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors (ctd) • General design of heterogeneous modifications • Matrices A, B, and C are identically partitioned into equal rectangular generalized blocks • The generalized blocks are identically partitioned into rectangles so that • There is one-to-one mapping between the rectangles and the processors • The area of each rectangle is (approximately) proportional to the speed of the processor which has the rectangle • Then, the algorithm follows the steps of its homogeneous prototype ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors (ctd) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors (ctd) • Why to partition the GBs in proportion to the speed • At each step, updating one r×rblock of matrix C needs the same amount of computation for all the blocks • => the load will be perfectly balanced if the number of blocks updated by each processor is proportional to its speed • The number = ni×NGB • ni= the area of the GB partition allocated to i-th processor (measured in r×rblocks) • => if the area of each GB partition ~ to the speed of the owing processor, their load will be perfectly balanced ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Partitioning matrices with constant models of heterogeneous processors (ctd) • A generalized block from partitioning POV • An integer-valued rectangular • If we need an asymptotically optimal solution, the problem can be reduced to a geometrical problem of optimal partitioning of a real-valued rectangle • the asymptotically optimal integer-valued solution can be obtained by rounding off an optimal real-valued solution of the geometrical partitioning problem ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Geometrical partitioning problem • The general geometrical partitioning problem • Given a set of p processorsP1, P2, ..., Pp, the relative speed of each of which is characterized by a positive constant, si, ( ), • Partition a unit square into p rectangles so that • There is one-to-one mapping between the rectangles and the processors • The area of the rectangle allocated to processor Pi is equal to si • The partitioning minimizes , where wi is the width and hi is the height of the rectangle allocated to processor Pi ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Geometrical partitioning problem (ctd) • Motivation behind the formulation • Proportionality of the areas to the speeds • Balancing the load of the processors • Minimization of the sum of half-perimeters • Multiple partitionings can balance the load • Minimizes the total volume of communications • At each step of MM, each receiving processor receives data ~ the half-perimeter of its rectangle • => In total, the communicated data ~ ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Geometrical partitioning problem (ctd) • Motivation behind the formulation (ctd) • Option: minimizing the maximal half-perimeter • Parallel communications • The use of a unit square instead of a rectangle • No loss of generality • the optimal solution for an arbitrary rectangle is obtained by straightforward scaling of that for the unit square • Proposition.The general geometrical partitioning problem is NP-complete. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Restricted geometrical partitioning problems • Restricted problems having polynomial solutions • Column-based • Grid-based • Column-based partitioning • Rectangles make up columns • Has an optimal solution of complexity O(p3) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Column-based partitioning problem ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Column-based partitioning problem (ctd) • A more restricted form of the column-based partitioning problem • The processors are already arranged into a set of columns • Algorithm 1:Optimal partitioning a unit square between p heterogeneous processors arranged into c columns, each of which is made of rjprocessors, j=1,…,c : • Let the relative speed of the i-th processor from the j-th column, Pij, be sij. • Then, we first partition the unit square into c vertical rectangular slices such that the width the j-th slice ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Column-based partitioning problem (ctd) • Algorithm 1: (ctd): • Second, each vertical slice is partitioned independently into rectangles in proportion with the speed of the processors in the corresponding processor column. • Algorithm 1 is of linear complexity. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Grid-based partitioning problem • Grid-based partitioning problem • The heterogeneous processors form a two-dimensional grid - There exist p and q such that any vertical line crossing the unit square will pass through exactly p rectangles and any horizontal line crossing the square will pass through exactly q rectangles ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Grid-based partitioning problem (ctd) • Proposition. Let a grid-based partitioning of the unit square between p heterogeneous processors form c columns, each of which consists of r processors, p=r×c. Then, the sum of half-perimeters of the rectangles of the partitioning will be equal to (r+c). • The shape r×c of the processor grid formed by any optimal grid-based partitioning will minimize (r+c). • The sum of half-perimeters of the rectangles of the optimal grid-based partitioning does not depend on the mapping of the processors onto the nodes of the grid. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Grid-based partitioning problem (ctd) • Algorithm 2:Optimal grid-based partitioning a unit square between p heterogeneous processors: • Step 1: Find the optimal shape r×cof the processor grid such that p=r×c and (r+c) is minimal. • Step 2:Map the processors onto the nodes of the grid. • Step 3: Apply Algorithm 3 of the optimal partitioning of the unit square to this r×carrangement of the p heterogeneous processors. • The correctness of Algorithm 2 is obvious. • Algorithm 2 returns a column-based partitioning. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Grid-based partitioning problem (ctd) • The optimal grid-based partitioning can be seen as a restricted form of column-based partitioning. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Grid-based partitioning problem (ctd) • Algorithm 3:Finding r and c such that p=r×c and (r+c) is minimal: ; while(r>1) if((p mod r)==0)) goto stop; else r--; stop: c = p / r; ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Grid-based partitioning problem (ctd) • Proposition.Algorithm 3 is correct. • Proposition.The complexity ofAlgorithm 2 can be bounded by O(p3/2). ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Experimental results Specifications of sixteen Linux computers on which the matrix multiplication is executed ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Experimental results (ctd) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning • Cartesian partitioning: • A column-based partitioning, the rectangles of which make up rows. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • Cartesian partitioning • Plays important role in design of heterogeneous parallel algorithms (e.g., in scalable algorithms) • The Cartesian partitioning problem • Very difficult • Their may be no Cartesian partitionings perfectly balancing the load of processors ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • Cartesian partitioning problem in general form • Given p processors, the speed of each of which is characterized by a given positive constant, • Find a Cartesian partitioning of a unit square such that • There is 1-to-1 mapping between the rectangles and the processors • The partitioning minimizes ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • The Cartesian partitioning problem • Not even studied in the general form. • If shape r×c is given, it proved NP-complete. • Unclear if there exists a polynomial algorithm when both the shape and the processors’ mapping are given • There exists an optimal Cartesian partitioning with processors arranged in a non-increasing order of speed ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • Approximate solutions of the Cartesian partitioning problem are based on the observation • Let the speed matrix {sij} of the given r×c processor arrangement be rank-one • Then there exists a Cartesian partitioning perfectly balancing the load of the processors ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • Algorithm 4: Finding an approximate solution of the simplified Cartesian problem (when only the shape r×c is given): • Step 1: Arrange the processors in a non-increasing order of speed • Step 2: For this arrangement, let and be the parameters of the partitioning • Step 3:Calculate the areas hi×wj of the rectangles of this partitioning ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • Algorithm 5: Finding an approximate solution of the simplified Cartesian problem when only the shape r×c is given (ctd): • Step 4: Re-arrange the processors so that • Step 5:IfStep 4 does not change the arrangement of the processors then return the current partitioning and stop the procedure else go to Step 2 ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • Proposition. Let a Cartesian partitioning of the unit square between p heterogeneous processors form c columns, each of which consists of r processors, p=r×c. Then, the sum of half-perimeters of the rectangles of the partitioning will be (r+c). • Proof is a trivial exercise • Minimization of the communication cost does not depend on the speeds of the processors but only on their number • => minimization of communication cost and minimization of computation cost are two independent problems • Any Cartesian partitioning minimizing(r+c) will optimize communication cost ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Application to Cartesian partitioning (ctd) • Now we can extend Algorithm 5 • By adding the 0-th step, finding the optimal r×c • The modified algorithm returns an approximate solution of the extended Cartesian problem • Aimed at minimization of both computation and communication cost • The modified Algorithm 5 will return an optimal solution if the speed matrix for the arrangement is a rank-one matrix ISPDC 2007, Hagenberg, Austria, 5-8 July 2007