1 / 19

Lecture 9 Architecture Independent (MPI) Algorithm Design

Lecture 9 Architecture Independent (MPI) Algorithm Design. Parallel Computing Fall 2008. Matrix Computations.

ward
Download Presentation

Lecture 9 Architecture Independent (MPI) Algorithm Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 9 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2008

  2. Matrix Computations • SPMD program design stipulates that processors executes a single program on different pieces of data. For matrix related computations it makes sense to distribute a matrix evenly among the p processors of a parallel computer. Such a distribution should also take into consideration the storage of the matrix by say the compiler so that locality issues are also taken into consideration (filling cache lines efficiently to speedup computation). There are various ways to divide a matrix. Some of the most common one are described below. • One way to distribute a matrix is by using block distributions. Split an array into blocks of size n/p1 × n/p2 so that p = p1 × p2 and assign the i-th block to processor i. This distribution is suitable for matrices as long as the amount of work for different elements of the matrix is the same. • The most common block distributions are. • • column-wise (block) distribution. Split matrix into p column stripes so that n/p consecutive columns form the i-th stripe that will be stored in processor i. This is p1 = 1 and p2 = p. • • row-wise (block) distribution. Split matrix into p row stripes so that n/p consecutive rows form the i-th stripe that will be stored in processor i. This is p1 = p and p2 = 1. • • block or square distribution. This is the case p1 = p2 = √p, i.e. the blocks are of size n/√p× n/√p and store block i to processor i. • There are certain cases (eg. LU decomposition, Cholesky factorization), where the amount of work differs for different elements of a matrix. For these cases block distributions are not suitable.

  3. Matrix block distributions

  4. Matrix-Vector Multiplication • Sequential Alg: the running time is O(n2). • n^2 multiplications and additions MAT_VECT(A,x,y) { for i=0 to n-1 do { y[i]=0; for j=0 to n-1 do y[i]=y[i]+A[i][j]*x[j]; } }

  5. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Assume p=n (p – no. of processors). • Steps: • Step 1: Initial partition of matrix and vector: • Matrix distribution: Each process get one complete row of the matrix. • Vector distribution: The n*1 vector is distributed such that each process owns one of its elements. • Step 2: All-to-all broadcast • Every process has one element of the vector, but every process needs the entire vector. • Step 3: computation • Process Pi computes • Running time: • All-to-all broadcast: θ(n) at any architecture • Multiplication of a single row of A and with vector x is θ(n) • Total running time is θ(n). • Total work is θ(n^2) – cost-optimal

  6. Matrix-Vector Multiplication: Rowwise 1-D Partitioning

  7. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Assume p<n (p – no. of processors). • Three Steps: • Initial partition of matrix and vector: • Each process initially stores n/p complete rows of the matrix and a portion of the vector of size n/p • All-to-all broadcast: • Among p processes and involved messages of size n/p • Computation: • Each process multiplies n/p rows of the matrix with the vector x to produce n/p elements of the result vector. • Running Time: • All-to-all broadcast: • T=(ts+ n/p tw)(p-1) on any architecture • T=ts logp + n/p tw(p-1) on hypercube • Computation: T=n* n/p =θ(n2/p) • Total running time T= θ(n2/p+ts logp + n tw) • Total work: W=θ(n2+ts p logp + n p tw) – cost-optimal

  8. Matrix-Vector Multiplication: Columnwise 1-D Partitioning • Similar to rowwise 1-D Partitioning

  9. Matrix-Vector Multiplication: 2-D Partitioning • Assume p=n2 • Steps: • Step 1: Initial partitioning • Each process get one element of matrix • The vector is distributed only processes in the diagonal, each of which owns one element. • Step 2: broadcast • The ith element of vector should be available to the ith element of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. • Step 3: computation • Each process multiplies its matrix element with the corresponding element of x. • Step 4: All-to-one reduction of partial results. • The products computed for each row must be added, leaving the sums in the last column of processes. • Running time: • One-to-all broadcast: θ(log n) • Computation in each process: θ(1) • All-to-one reduction: θ(log n) • Total running time: θ(log n) • Total work: θ(n2 log n) – not cost-optimal

  10. Matrix-Vector Multiplication: 2-D Partitioning

  11. Matrix-Vector Multiplication: 2-D Partitioning • Assume p<n2 • Steps: • Step 1: Initial partitioning • Each process get (n/p)*(n/p) of matrix • The vector is distributed only processes in the diagonal, each of which owns n/p element. • Step 2: columwise one-to-all broadcast • The ith group of elements of vector should be available to the ith group of each row of matrix. So this step consists of n simultaneous one-to-all broadcast operations, one in each column of processes. • Step 3: computation • Each process multiplies its n/p matrix element with the corresponding element of x. • Step 4: All-to-one reduction of partial results. • The products computed for each row must be added, leaving the sums in the last column of processes. • Running time: • Columnwise one-to-all broadcast: T= (ts+ n/p tw)(log p) on any architecture • Computation in each process: T=n/p* n/p • All-to-one reduction: T= (ts+ n/p tw)(log p) on any architecture • Total running time: T= n2/p + 2(ts+ n/p tw)(log p) on any architecture

  12. Matrix-Vector Multiplication: 1-D Partitioning vs. 2-D Partitioning • Matrix-vector multiplication is faster with block 2-D partitioning of the matrix than with block 1-D partitioning for the same number of processes. • If the number of processes is greater than n, then the 1-D partitioning cannot be used. • If the number of processes is less than or equal to n, 2-D partitioning is preferable.

  13. Matrix Distributions : Block cyclic • In block cyclic distributions the rows (similarly for columns) are split into q groups of n/q consecutive rows per group, where potentially q > p, and the i-th group is assigned to a processor in a cyclic fashion. • • column-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q column stripes so that n/q consecutive columns form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around column distribution is used for the case where n/q = 1, i.e. q = n. • • row-cyclic distribution. This is an one-dimensional cyclic distribution. Split matrix into q row stripes so that n/q consecutive rows form the i-th stripe that will be stored in processor i %p. The symbol % is the mod (remainder of the division) operator. Usually q > p. Sometimes the term wrapped-around row distribution is used for the case where n/q = 1, i.e. q = n. • • scattered distribution. Let p = qi · qjprocessors be divided into qjgroups each group Pjconsisting of qiprocessors. Particularly, Pj= {jqi+ l | 0 ≤ l ≤ qi − 1}. Processor jqi+ l is called the l-th processor of group Pj. This way matrix element (i, j), 0 ≤ i, j < n, is assigned to the (i mod qi)-th processor of group P(j mod qj). A scattered distribution refers to the special case qi= qj= √p.

  14. Block cyclic distributions

  15. Scattered Distribution

  16. Matrix Multiplication – Serial algorithm

  17. Matrix Multiplication • The algorithm for matrix multiplication presented below was presented in the seminal work of Valiant. It works for p ≤ n2. • Three steps: • Initial partitioning: Matrices A and B are partitioned into p blocks Ai,j, and Bi,j (1 <=i,j < √p) of size n/√p × n/√p each. These blocks are mapped onto a √p × √p logical mesh of processes. The process are labeled from P0,0 to P √p-1,√p -1. • All-to-all broadcasting: Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤k<√p. To aquire all the required blocks, an all-to-all broadcast of matrix A’s block is performed in each row of processes, and an all-to-all broadcast of matrix B’s blocks is performed in each column. • Computation: After Pi,j acquire Ai,0, Ai,1, …, Ai, √p -1 and B0,j, B1,j, …, B √p -1,j, it performs the submatrix multiplication and addition step of line 7 and line 8 in Alg 8.3. • Running time: • All-to-all broadcast: • T=(ts+ n^2/p tw)(p-1) on any architecture • T=ts log  p + n^2/p tw( p-1) on hypercube • Computation: • T= p*(n/p)^3=n^3/p.

  18. Matrix Multiplication • The input matrices A and B are divided into p block-submatrices, each one of dimension m× m, where m = n/√p. We call this distribution of the input among the processors block distribution. This way, element A(i, j), 0 ≤ i < n, 0 ≤ j < n, belongs to the (j/m)∗√p+(i/m)-th block that is subsequently assigned to the memory of the same-numbered processor. • Let Ai(respectively, Bi) denote the i-th block of A (respectively, B) stored in processor i. With these conventions the algorithm can be described in Figure 1. The following Proposition describes the performance of the aforementioned algorithm.

  19. Matrix Multiplication

More Related