1 / 36

On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors

This paper discusses the partitioning of matrices on a grid of heterogeneous processors, ensuring balanced computation load and optimal performance. It presents an algorithm based on 2D block cyclic ScaLAPACK and proposes a generalized block partitioning approach.

arlenedean
Download Presentation

On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of Computer Science and Informatics University College Dublin Alexey.Lastovetsky@ucd.ie ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  2. Heterogeneous parallel computing • Heterogeneity of processors • The processors run at different speeds • Even distribution of computations do not balance processors’ load • The performance is determined by the slowest processor • Data must be distributed unevenly • So that each processor will perform the volume of computation proportional to its speed ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  3. Constant performance models of heterogeneous processors • The simplest performance model of heterogeneous processors • p, the number of the processors, • S={s1, s2, ..., sp}, the speeds of the processors (positive constants). • The speed • Absolute: the number of computational units performed by the processor per one time unit • Relative: • Some use the execution time: ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  4. Data distribution problems with constant models of heterogeneous processors • Typical design of heterogeneous parallel algorithms • Problem of distribution of computations in proportion to the speed of processors • Problem of partitioning of some mathematical objects • Sets, matrices, graphs, geometric figures, etc. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  5. Partitioning matrices with constant models of heterogeneous processors • Matrices • Most widely used math. objects in scientific computing • Studied partitioning problems mainly deal with matrices • Matrix partitioning in one dimension over a 1D arrangement of processors • Often reduced to partitioning sets or well-ordered sets • Design of algorithms often results in matrix partitioning problems not imposing the restriction of partitioning in one dimension • E.g., in parallel linear algebra for heterogeneous platforms • We will use matrix multiplication • A simple but very important linear algebra kernel ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  6. Partitioning matrices with constant models of heterogeneous processors (ctd) • A heterogeneous matrix multiplication algorithm • A modification of some homogeneous one • Most often, of the 2D block cyclic ScaLAPACK algorithm ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  7. Partitioning matrices with constant models of heterogeneous processors (ctd) • 2D block cyclic ScaLAPACK MM algorithm (ctd) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  8. Partitioning matrices with constant models of heterogeneous processors (ctd) • 2D block cyclic ScaLAPACK MM algorithm (ctd) • The matrices are identically partitioned into rectangular generalized blocks of the size (p×r)×(q×r) • Each generalized block forms a 2D p×q grid of r×r blocks • There is 1-to-1 mapping between this grid of blocks and the p×qprocessor grid • At each step of the algorithm • Each processor not owing the pivot row and column receives horizontally (n/p)×relements of matrix Aand vertically (n/q)×relements of matrix B • => in total, , i.e., ~ the half-perimeter of the rectangle area allocated to the processor ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  9. Partitioning matrices with constant models of heterogeneous processors (ctd) • General design of heterogeneous modifications • Matrices A, B, and C are identically partitioned into equal rectangular generalized blocks • The generalized blocks are identically partitioned into rectangles so that • There is one-to-one mapping between the rectangles and the processors • The area of each rectangle is (approximately) proportional to the speed of the processor which has the rectangle • Then, the algorithm follows the steps of its homogeneous prototype ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  10. Partitioning matrices with constant models of heterogeneous processors (ctd) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  11. Partitioning matrices with constant models of heterogeneous processors (ctd) • Why to partition the GBs in proportion to the speed • At each step, updating one r×rblock of matrix C needs the same amount of computation for all the blocks • => the load will be perfectly balanced if the number of blocks updated by each processor is proportional to its speed • The number = ni×NGB • ni= the area of the GB partition allocated to i-th processor (measured in r×rblocks) • => if the area of each GB partition ~ to the speed of the owing processor, their load will be perfectly balanced ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  12. Partitioning matrices with constant models of heterogeneous processors (ctd) • A generalized block from partitioning POV • An integer-valued rectangular • If we need an asymptotically optimal solution, the problem can be reduced to a geometrical problem of optimal partitioning of a real-valued rectangle • the asymptotically optimal integer-valued solution can be obtained by rounding off an optimal real-valued solution of the geometrical partitioning problem ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  13. Geometrical partitioning problem • The general geometrical partitioning problem • Given a set of p processorsP1, P2, ..., Pp, the relative speed of each of which is characterized by a positive constant, si, ( ), • Partition a unit square into p rectangles so that • There is one-to-one mapping between the rectangles and the processors • The area of the rectangle allocated to processor Pi is equal to si • The partitioning minimizes , where wi is the width and hi is the height of the rectangle allocated to processor Pi ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  14. Geometrical partitioning problem (ctd) • Motivation behind the formulation • Proportionality of the areas to the speeds • Balancing the load of the processors • Minimization of the sum of half-perimeters • Multiple partitionings can balance the load • Minimizes the total volume of communications • At each step of MM, each receiving processor receives data ~ the half-perimeter of its rectangle • => In total, the communicated data ~ ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  15. Geometrical partitioning problem (ctd) • Motivation behind the formulation (ctd) • Option: minimizing the maximal half-perimeter • Parallel communications • The use of a unit square instead of a rectangle • No loss of generality • the optimal solution for an arbitrary rectangle is obtained by straightforward scaling of that for the unit square • Proposition.The general geometrical partitioning problem is NP-complete. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  16. Restricted geometrical partitioning problems • Restricted problems having polynomial solutions • Column-based • Grid-based • Column-based partitioning • Rectangles make up columns • Has an optimal solution of complexity O(p3) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  17. Column-based partitioning problem ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  18. Column-based partitioning problem (ctd) • A more restricted form of the column-based partitioning problem • The processors are already arranged into a set of columns • Algorithm 1:Optimal partitioning a unit square between p heterogeneous processors arranged into c columns, each of which is made of rjprocessors, j=1,…,c : • Let the relative speed of the i-th processor from the j-th column, Pij, be sij. • Then, we first partition the unit square into c vertical rectangular slices such that the width the j-th slice ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  19. Column-based partitioning problem (ctd) • Algorithm 1: (ctd): • Second, each vertical slice is partitioned independently into rectangles in proportion with the speed of the processors in the corresponding processor column. • Algorithm 1 is of linear complexity. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  20. Grid-based partitioning problem • Grid-based partitioning problem • The heterogeneous processors form a two-dimensional grid - There exist p and q such that any vertical line crossing the unit square will pass through exactly p rectangles and any horizontal line crossing the square will pass through exactly q rectangles ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  21. Grid-based partitioning problem (ctd) • Proposition. Let a grid-based partitioning of the unit square between p heterogeneous processors form c columns, each of which consists of r processors, p=r×c. Then, the sum of half-perimeters of the rectangles of the partitioning will be equal to (r+c). • The shape r×c of the processor grid formed by any optimal grid-based partitioning will minimize (r+c). • The sum of half-perimeters of the rectangles of the optimal grid-based partitioning does not depend on the mapping of the processors onto the nodes of the grid. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  22. Grid-based partitioning problem (ctd) • Algorithm 2:Optimal grid-based partitioning a unit square between p heterogeneous processors: • Step 1: Find the optimal shape r×cof the processor grid such that p=r×c and (r+c) is minimal. • Step 2:Map the processors onto the nodes of the grid. • Step 3: Apply Algorithm 3 of the optimal partitioning of the unit square to this r×carrangement of the p heterogeneous processors. • The correctness of Algorithm 2 is obvious. • Algorithm 2 returns a column-based partitioning. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  23. Grid-based partitioning problem (ctd) • The optimal grid-based partitioning can be seen as a restricted form of column-based partitioning. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  24. Grid-based partitioning problem (ctd) • Algorithm 3:Finding r and c such that p=r×c and (r+c) is minimal: ; while(r>1) if((p mod r)==0)) goto stop; else r--; stop: c = p / r; ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  25. Grid-based partitioning problem (ctd) • Proposition.Algorithm 3 is correct. • Proposition.The complexity ofAlgorithm 2 can be bounded by O(p3/2). ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  26. Experimental results Specifications of sixteen Linux computers on which the matrix multiplication is executed ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  27. Experimental results (ctd) ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  28. Application to Cartesian partitioning • Cartesian partitioning: • A column-based partitioning, the rectangles of which make up rows. ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  29. Application to Cartesian partitioning (ctd) • Cartesian partitioning • Plays important role in design of heterogeneous parallel algorithms (e.g., in scalable algorithms) • The Cartesian partitioning problem • Very difficult • Their may be no Cartesian partitionings perfectly balancing the load of processors ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  30. Application to Cartesian partitioning (ctd) • Cartesian partitioning problem in general form • Given p processors, the speed of each of which is characterized by a given positive constant, • Find a Cartesian partitioning of a unit square such that • There is 1-to-1 mapping between the rectangles and the processors • The partitioning minimizes ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  31. Application to Cartesian partitioning (ctd) • The Cartesian partitioning problem • Not even studied in the general form. • If shape r×c is given, it proved NP-complete. • Unclear if there exists a polynomial algorithm when both the shape and the processors’ mapping are given • There exists an optimal Cartesian partitioning with processors arranged in a non-increasing order of speed ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  32. Application to Cartesian partitioning (ctd) • Approximate solutions of the Cartesian partitioning problem are based on the observation • Let the speed matrix {sij} of the given r×c processor arrangement be rank-one • Then there exists a Cartesian partitioning perfectly balancing the load of the processors ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  33. Application to Cartesian partitioning (ctd) • Algorithm 4: Finding an approximate solution of the simplified Cartesian problem (when only the shape r×c is given): • Step 1: Arrange the processors in a non-increasing order of speed • Step 2: For this arrangement, let and be the parameters of the partitioning • Step 3:Calculate the areas hi×wj of the rectangles of this partitioning ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  34. Application to Cartesian partitioning (ctd) • Algorithm 5: Finding an approximate solution of the simplified Cartesian problem when only the shape r×c is given (ctd): • Step 4: Re-arrange the processors so that • Step 5:IfStep 4 does not change the arrangement of the processors then return the current partitioning and stop the procedure else go to Step 2 ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  35. Application to Cartesian partitioning (ctd) • Proposition. Let a Cartesian partitioning of the unit square between p heterogeneous processors form c columns, each of which consists of r processors, p=r×c. Then, the sum of half-perimeters of the rectangles of the partitioning will be (r+c). • Proof is a trivial exercise • Minimization of the communication cost does not depend on the speeds of the processors but only on their number • => minimization of communication cost and minimization of computation cost are two independent problems • Any Cartesian partitioning minimizing(r+c) will optimize communication cost ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

  36. Application to Cartesian partitioning (ctd) • Now we can extend Algorithm 5 • By adding the 0-th step, finding the optimal r×c • The modified algorithm returns an approximate solution of the extended Cartesian problem • Aimed at minimization of both computation and communication cost • The modified Algorithm 5 will return an optimal solution if the speed matrix for the arrangement is a rank-one matrix ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

More Related