1 / 44

Outline

CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola. Outline. Matrix multiplication – one processor Matrix multiplication – parallel Algorithm Example Analysis Matrix transpose Algorithm Example

Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSc 8530Matrix Multiplication and Transpose By Jaman Bhola

  2. Outline • Matrix multiplication – one processor • Matrix multiplication – parallel • Algorithm • Example • Analysis • Matrix transpose • Algorithm • Example • Analysis

  3. Matrix multiplication – 1 processor • Using one processor – O(n3) time • Algorithm: for (i = 0; i < n; i++) { for (j = 0; j < n; j ++) { t = 0; for(k = 0; k < n; k++) { t = t + aik * bkj; } cij = t;} }

  4. Matrix multiplication – parallel • Using Hypercube: The algorithm given in the book assumes the multiplication of two n x n matrices where n can be factored into a power of 2. • This will facilitate a hypercube network. • Need N = n3 = 23q processors where n = 2q is the size of the matrix.

  5. N processors allowing each processor to occupy a vertex in the hypercube. • Each processor Pr has a given position – where r = in2 + jn + k for 0<= i,j,k <= n-1 • If r is represented by: r3q-1r3q-2…r2qr2q-1…rqrq-1…r0 then the binary representation of i, j, k are r3q-1r3q-2…r2q,r2q-1…rq,rq-1…r0 respectively • This allow the positioning of processors such that their position only differ by one binary digit location.

  6. Also this allow all processors that agree in one or two of the positions i,j,k will form a hypercube • Example, building a hypercube for q = 1, then for N = n3 = 23q N = 8 processors. • And for Pr where r = in2 + jn + k we get:

  7. i j k P0 r = 0 + 0 + 0 = 0 0 0 0 P1 r = 0 + 0 + 1 = 1 0 0 1 P2 r = 0 + 2 + 0 = 2 0 1 0 P3 r = 0 + 2 + 1 = 3 0 1 1 P4 r = 4 + 0 + 0 = 4 1 0 0 P5 r = 4 + 0 + 1 = 5 1 0 1 P6 r = 4 + 2 + 0 = 6 1 1 0 P7 r = 4 + 2 + 1 = 7 1 1 1

  8. 100 101 000 001 010 011 110 111

  9. Processor Layout • Each processor will have 3 registers Ar, Br and Cr  P0 • The following is the step by step description of the algorithm B A C

  10. Step 1:The elements of A and B are distributed to the n3 processors so that the processor in position i,j,k will contain aji and bik • (1.1): Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = aij and B(i,j,k) = bjk for 0<=i<=n-1. • (1.2) Copies of data in A(i,j,k) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = aji for 0<=k<=n-1. • (1.3) Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = bik for 0<=j<=n-1.

  11. Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) Thus C(i,j,k) = aji * bik for 0<=i,j,k<=n-1 • Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.

  12. The algorithm: Step1: (1.1) for m = 3q – 1 downto 2q do for all r ε N(rm = 0) do in parallel (i) Ar(m) = Ar (ii) Br(m) = Br end for end for (1.2) for m = q-1 downto 0 do for all r ε N(rm = r2q+m) do in parallel Ar(m) = Ar end for end for

  13. (1.3) for m = 2q-1 downto 0 do for all r ε N(rm = rq+m) do in parallel Br(m) = Br end for end for Step 2: for r = 0 to N-1 do in parallel Cr = Ar * Br end for

  14. Step 3: for m = 2q to 3q - 1 do for all r ε N(rm = 0) do in parallel Cr(m) = Cr + Cr(m) end for end for

  15. An Example using a 2x2 matrix. • This example will require n3 processors = 8. The matrices are 1 2 A = 3 4 -1 -2 B = -3 -4

  16. 1, -1 2, -2 1, -1 2, -2 1, -1 2, -2 3, -3 4, -4 3, -3 4, -4 3, -3 4, -4 2, -4 2, -3 2, -1 2, -2 1, -1 1, -2 1, -1 1, -2 3, -1 3, -2 3, -3 3, -4 4, -3 4, -4 4, -3 4, -4

  17. -8 -6 -1 -2 -3 -6 -16 -12 -7 -10 -15 -22

  18. Analysis of algorithm • If the layout of the processors is viewed as a n x n x n array, then there consist of a layer of processors n each with an n x n array of processors. • Initially, this first layer – n will have a distinct value from matrix A in its A register and a distinct value from matrix B in its B register. This is constant time operation. • Step 1.1: Copies are sent to n/2 processors, and continually to n/4, etc – O(log n) to copy data from layer 0 to layers n-1

  19. Step1.2 and 1.3. Each processor from column i in layer i sends data to processor in its row. Similar from row i sending data to processor in its column. Requiring constant time iterations. • Step 2 require constant time • Step 3 require constant time iteration • Overall, it requires O(log n) time • But cost is O(n3 log n) – not optimal.

  20. A faster algorithm - Quinn • For all Pm, where 1<=m<=p for i = m to n step p do for j = 1 to n do t = 0; for k = 1 to n do t = t + a[i][k] * b[k][j] c[i][j] = t time  O(n3/p + p) – maximum # of processors – n2

  21. An actual implementation • Get the processor id This if statement is to make sure that the entire size of the matrix is computed chunksize = (int) (n/p); if ((chunksize * nprocs) <= sizes){ int differ = n - (chunksize*p); if (id == 0) lower = id *chunksize; else{ lower = id * chunksize + differ + 1; upper = (id + 1) * chunksize + differ; }

  22. else{ lower = id * chunksize; upper = (id + 1) * chunksize;} for (i = lower; i < upper; i++){ for(j = 0; j < n; j++){ total = 0; for (k = 0; k < n; k++){ total = total + mat1[i][k] * mat2[k][j]; } mat3[i][j] = total; }}

  23. Another faster Algorithm – Gupta & Sadayappan • The 3-D Diagonal Algorithm is a 3 phase algorithm. • The concept: a hypercube of p processors viewed as a 3-D mesh of size 3√p x 3√p x 3√p • Matrices A and B are partitioned into blocks of p⅔ with 3√p blocks along each dimension. • Initially, it is assumed that A and B are mapped onto the 2-D plane x = y and the 2-D plane y = j is responsible for calculating the outer product of A*,j (the set of columns stored at processors pj,j,*) and Bj,* (the set of rows of B).

  24. Phase 1: Point to point communication of Bk,i by pi,i,k to pi,k,k • Phase 2: One-to-all broadcasts of blocks of A along the x-direction and the newly acquired blocks (from phase 1) of B along the z-direction i.e. processor pi,i,k broadcasts Ak,i to p*,i,k and all other processor of the form of pi,i,k broadcasts Bk,i to pi,k,* • At the end of phase 2, every processor pi,j,k has blocks Ak,j and Bj,i • Each processor now calculates the product of their pair of blocks A and B.

  25. Phase 3: After computation, there is reduction by addition in the y-direction providing the final matrix C.

  26. Algorithm Analysis • Phase 1: Passing messages of size n2/ p⅔ require log(3√p(ts + tw(n2/ p⅔ ))) where ts is the time it takes to start up for message sending and tw is time it takes to send a word from one processor to its neighbor. • Phase two takes twice as much time as phase 1. • Phase 3: Can be completed in the same amount of time as Phase 1. • Overall, the algorithm takes (4/3 log p, n2/ p⅔ (4/3 log p)) where communication for each entry is tsa + twb

  27. Some added conditions are: 1. p <= n3 2. Overall space used  2n23√p • The above description is for a one port hypercube architecture whereby a processor can use at most one communication link to send and receive data. • A multi-port architecture, whereby the processor can use all of its communication ports simultaneously, the algorithm will be faster reducing the above amount of time by a factor of log(3√p).

  28. The algorithm • Initial distribution – Processor pi,i,k contains Aki and Bki • Program of processor pi,j,k • If (i = j) then Send Bki to pi,k,k Broadcast Bji to all processors pi,j,j endif Receive Akj from pi,j,j Calculate Iki = Akj x Bji Send Iki to pi,i,k

  29. if( i = j) for I = 0 to 3√p – 1 Receive Iki frompi,i,k Cki = Cki + Iki endfor endif • I is an intermediate matrix.

  30. Matrix Transposition • The same concept is used here as in Matrix multiplication • The number of processors used is N = n2 = 22q and processor Pr occupies position (i,j) where r = in + j where 0<=i,j<=n-1. • Initially, processor Pr holds all of the elements of matrix A where r = in + j. • Upon termination, processor Ps holds element aij where s = jn + i.

  31. If r is represented by: r2q-1r2q-2…rq rq-1… r1r0 then the binary representation of i and j are r2q-1 r2q-2 …rq,rq-1…r1 r0 respectively • And s is represented by s2q-1s2q-2…sq sq-1… s1s0 • And the binary representation of j and i is s2q-1 r2q-2 …rq,rq-1…r1 r0 respectively • Thus it could be seen that for example r2q-1 r2q-2 …rq = sq-1… s1s0 and rq-1 rq-2 … r0 = s2q-1s2q-2…sq

  32. The algorithm • First the requirements for the algorithm – it needs the processors to have registers – Au and Bu both of processor Pu • The index of Pu will be u = u2q-1u2q-2…uq uq-1… u1u0 matching that of r.

  33. For m = 2q-1 downto q do for u = 0 to N-1 do in parallel (1) if um≠ um-q then Bu(m) = Au endif (2) if um =um-q then Au(m-q) = Bu endif endfor endfor

  34. Explanation of algorithm • This algorithm is implemented using recursion to achieve the transpose of A. • Divide the matrix into 4 submatrices – n/2 x n/2. • For iteration 1 when m = 2q-1, swap elements of the top right submatrix with that of the bottom left submatrix. The other 2 submatrices are not touched. • Now recursively do this until all of the elements are swapped.

  35. Example. • We want the transpose of the following matrix: a b c d A = e f g h i j k l m n o p

  36. We use 16 processors with the following indices: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

  37. Drawing a hypercube for this: 6 7 14 15 4 5 12 13 2 3 10 11 0 1 8 9

  38. Processor 0 – binary 0000 holds a00 which is the value a Processor 1 – binary 0001 holds a01 which is the value b Processor 2 – binary 0010 holds a02 which is the value c And so on In the first iteration m = 2q-1 where q = 2 in this example  m = 3. Step 1: Each Pu for u3≠ u1 sends their element of Au to Pu(3) which stores the value in Bu(3) i.e. processors 2, 3, 6 & 7 send to processors 10, 11, 14 & 15 respectively. And processors 8, 9, 12 & 13 send to processors 0, 1, 4 & 5 respectively.

  39. Step 2: Each processor that received a data in Step 1, will now send the data from Bu to Pu(1) to be stored in Au(1), i.e. • Processors 0, 1, 4, 5 send to 2, 3, 6, 7 respectively • Processors 10, 11, 14, 15 send to 8, 9, 12, 13 respectively • By the end of the first iteration our matrix A will look like: a b i j A = e f m n c d k l g h o p

  40. In the second iteration when m = q = 2: • Step 1: Each Pu (where u2≠ u0 ) sends Au to Pu(2) storing it in Bu(2). This is a simultaneous transfer: From: processor 4 to processor 0 processor 1 to processor 5 processor 6 to processor 2 processor 3 to processor 7 processor 12 to processor 8 processor 9 to processor 13 processor 14 to processor 10 processor 11 to processor 15

  41. Step 2: For u2 = u0, each Pu sends Bu to Pu(0) where it is stored in Au(0) thus Swap the element in the top right corner processor with that in the bottom left corner for each of the 2 x 2 submatrices. • From: processor 0 to processor 1 processor 5 to processor 4 processor 2 to processor 3 processor 7 to processor 6 processor 8 to processor 9 processor 13 to processor 12 processor 10 to processor 11 processor 15 to processor 14

  42. After the second iteration, we have the following transposed matrix: a ei m A = b fj n c gk o d hl p

  43. Algorithm Analysis • It takes q constant time iterations giving t(n) = O(log n) • But it takes n2 processors. • Therefore Cost = (n2 log n) which is not optimal.

  44. Bibliography • Akl, Parallel Computation, Models and Methods, Prentice Hall 1997. • Drake, J.B. and Luo, Q., A scalable Parallel Strassen’s matrix Multiplication Algorithm For Distributed-Memory Computers, February 1995   Proceedings of the 1995 ACM symposium on Applied computing , 221- 226 • Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994   Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, 320 - 329 • Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997

More Related