Outline

CSc 8530Matrix Multiplication and Transpose By Jaman Bhola

Outline • Matrix multiplication – one processor • Matrix multiplication – parallel • Algorithm • Example • Analysis • Matrix transpose • Algorithm • Example • Analysis

Matrix multiplication – 1 processor • Using one processor – O(n3) time • Algorithm: for (i = 0; i < n; i++) { for (j = 0; j < n; j ++) { t = 0; for(k = 0; k < n; k++) { t = t + aik * bkj; } cij = t;} }

Matrix multiplication – parallel • Using Hypercube: The algorithm given in the book assumes the multiplication of two n x n matrices where n can be factored into a power of 2. • This will facilitate a hypercube network. • Need N = n3 = 23q processors where n = 2q is the size of the matrix.

N processors allowing each processor to occupy a vertex in the hypercube. • Each processor Pr has a given position – where r = in2 + jn + k for 0<= i,j,k <= n-1 • If r is represented by: r3q-1r3q-2…r2qr2q-1…rqrq-1…r0 then the binary representation of i, j, k are r3q-1r3q-2…r2q,r2q-1…rq,rq-1…r0 respectively • This allow the positioning of processors such that their position only differ by one binary digit location.

Also this allow all processors that agree in one or two of the positions i,j,k will form a hypercube • Example, building a hypercube for q = 1, then for N = n3 = 23q N = 8 processors. • And for Pr where r = in2 + jn + k we get:

i j k P0 r = 0 + 0 + 0 = 0 0 0 0 P1 r = 0 + 0 + 1 = 1 0 0 1 P2 r = 0 + 2 + 0 = 2 0 1 0 P3 r = 0 + 2 + 1 = 3 0 1 1 P4 r = 4 + 0 + 0 = 4 1 0 0 P5 r = 4 + 0 + 1 = 5 1 0 1 P6 r = 4 + 2 + 0 = 6 1 1 0 P7 r = 4 + 2 + 1 = 7 1 1 1

100 101 000 001 010 011 110 111

Processor Layout • Each processor will have 3 registers Ar, Br and Cr  P0 • The following is the step by step description of the algorithm B A C

Step 1:The elements of A and B are distributed to the n3 processors so that the processor in position i,j,k will contain aji and bik • (1.1): Copies of data initially in A(0,j,k) and B(0,j,k) are sent to processors in positions (i,j,k), where 1<=i<=n-1. Resulting in A(i,j,k) = aij and B(i,j,k) = bjk for 0<=i<=n-1. • (1.2) Copies of data in A(i,j,k) are sent to processors in positions (i,j,k), where 0<=k<=n-1. Resulting in A(i,j,k) = aji for 0<=k<=n-1. • (1.3) Copies of data in B(i,j,k) are sent to processors in positions (i,j,k), where 0<=j<=n-1. Resulting in B(i,j,k) = bik for 0<=j<=n-1.

Step 2: Each processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) * B(i,j,k) Thus C(i,j,k) = aji * bik for 0<=i,j,k<=n-1 • Step 3: The sum C(0,j,k) = ∑C(i,j,k) for 0<=i<=n-1 and is computed for 0<=j,k<n-1.

The algorithm: Step1: (1.1) for m = 3q – 1 downto 2q do for all r ε N(rm = 0) do in parallel (i) Ar(m) = Ar (ii) Br(m) = Br end for end for (1.2) for m = q-1 downto 0 do for all r ε N(rm = r2q+m) do in parallel Ar(m) = Ar end for end for

(1.3) for m = 2q-1 downto 0 do for all r ε N(rm = rq+m) do in parallel Br(m) = Br end for end for Step 2: for r = 0 to N-1 do in parallel Cr = Ar * Br end for

Step 3: for m = 2q to 3q - 1 do for all r ε N(rm = 0) do in parallel Cr(m) = Cr + Cr(m) end for end for

An Example using a 2x2 matrix. • This example will require n3 processors = 8. The matrices are 1 2 A = 3 4 -1 -2 B = -3 -4

1, -1 2, -2 1, -1 2, -2 1, -1 2, -2 3, -3 4, -4 3, -3 4, -4 3, -3 4, -4 2, -4 2, -3 2, -1 2, -2 1, -1 1, -2 1, -1 1, -2 3, -1 3, -2 3, -3 3, -4 4, -3 4, -4 4, -3 4, -4

-8 -6 -1 -2 -3 -6 -16 -12 -7 -10 -15 -22

Analysis of algorithm • If the layout of the processors is viewed as a n x n x n array, then there consist of a layer of processors n each with an n x n array of processors. • Initially, this first layer – n will have a distinct value from matrix A in its A register and a distinct value from matrix B in its B register. This is constant time operation. • Step 1.1: Copies are sent to n/2 processors, and continually to n/4, etc – O(log n) to copy data from layer 0 to layers n-1

Step1.2 and 1.3. Each processor from column i in layer i sends data to processor in its row. Similar from row i sending data to processor in its column. Requiring constant time iterations. • Step 2 require constant time • Step 3 require constant time iteration • Overall, it requires O(log n) time • But cost is O(n3 log n) – not optimal.

A faster algorithm - Quinn • For all Pm, where 1<=m<=p for i = m to n step p do for j = 1 to n do t = 0; for k = 1 to n do t = t + a[i][k] * b[k][j] c[i][j] = t time  O(n3/p + p) – maximum # of processors – n2

An actual implementation • Get the processor id This if statement is to make sure that the entire size of the matrix is computed chunksize = (int) (n/p); if ((chunksize * nprocs) <= sizes){ int differ = n - (chunksize*p); if (id == 0) lower = id *chunksize; else{ lower = id * chunksize + differ + 1; upper = (id + 1) * chunksize + differ; }

else{ lower = id * chunksize; upper = (id + 1) * chunksize;} for (i = lower; i < upper; i++){ for(j = 0; j < n; j++){ total = 0; for (k = 0; k < n; k++){ total = total + mat1[i][k] * mat2[k][j]; } mat3[i][j] = total; }}

Another faster Algorithm – Gupta & Sadayappan • The 3-D Diagonal Algorithm is a 3 phase algorithm. • The concept: a hypercube of p processors viewed as a 3-D mesh of size 3√p x 3√p x 3√p • Matrices A and B are partitioned into blocks of p⅔ with 3√p blocks along each dimension. • Initially, it is assumed that A and B are mapped onto the 2-D plane x = y and the 2-D plane y = j is responsible for calculating the outer product of A*,j (the set of columns stored at processors pj,j,*) and Bj,* (the set of rows of B).

Phase 1: Point to point communication of Bk,i by pi,i,k to pi,k,k • Phase 2: One-to-all broadcasts of blocks of A along the x-direction and the newly acquired blocks (from phase 1) of B along the z-direction i.e. processor pi,i,k broadcasts Ak,i to p*,i,k and all other processor of the form of pi,i,k broadcasts Bk,i to pi,k,* • At the end of phase 2, every processor pi,j,k has blocks Ak,j and Bj,i • Each processor now calculates the product of their pair of blocks A and B.

Phase 3: After computation, there is reduction by addition in the y-direction providing the final matrix C.

Algorithm Analysis • Phase 1: Passing messages of size n2/ p⅔ require log(3√p(ts + tw(n2/ p⅔ ))) where ts is the time it takes to start up for message sending and tw is time it takes to send a word from one processor to its neighbor. • Phase two takes twice as much time as phase 1. • Phase 3: Can be completed in the same amount of time as Phase 1. • Overall, the algorithm takes (4/3 log p, n2/ p⅔ (4/3 log p)) where communication for each entry is tsa + twb

Some added conditions are: 1. p <= n3 2. Overall space used  2n23√p • The above description is for a one port hypercube architecture whereby a processor can use at most one communication link to send and receive data. • A multi-port architecture, whereby the processor can use all of its communication ports simultaneously, the algorithm will be faster reducing the above amount of time by a factor of log(3√p).

The algorithm • Initial distribution – Processor pi,i,k contains Aki and Bki • Program of processor pi,j,k • If (i = j) then Send Bki to pi,k,k Broadcast Bji to all processors pi,j,j endif Receive Akj from pi,j,j Calculate Iki = Akj x Bji Send Iki to pi,i,k

if( i = j) for I = 0 to 3√p – 1 Receive Iki frompi,i,k Cki = Cki + Iki endfor endif • I is an intermediate matrix.

Matrix Transposition • The same concept is used here as in Matrix multiplication • The number of processors used is N = n2 = 22q and processor Pr occupies position (i,j) where r = in + j where 0<=i,j<=n-1. • Initially, processor Pr holds all of the elements of matrix A where r = in + j. • Upon termination, processor Ps holds element aij where s = jn + i.

If r is represented by: r2q-1r2q-2…rq rq-1… r1r0 then the binary representation of i and j are r2q-1 r2q-2 …rq,rq-1…r1 r0 respectively • And s is represented by s2q-1s2q-2…sq sq-1… s1s0 • And the binary representation of j and i is s2q-1 r2q-2 …rq,rq-1…r1 r0 respectively • Thus it could be seen that for example r2q-1 r2q-2 …rq = sq-1… s1s0 and rq-1 rq-2 … r0 = s2q-1s2q-2…sq

The algorithm • First the requirements for the algorithm – it needs the processors to have registers – Au and Bu both of processor Pu • The index of Pu will be u = u2q-1u2q-2…uq uq-1… u1u0 matching that of r.

For m = 2q-1 downto q do for u = 0 to N-1 do in parallel (1) if um≠ um-q then Bu(m) = Au endif (2) if um =um-q then Au(m-q) = Bu endif endfor endfor

Explanation of algorithm • This algorithm is implemented using recursion to achieve the transpose of A. • Divide the matrix into 4 submatrices – n/2 x n/2. • For iteration 1 when m = 2q-1, swap elements of the top right submatrix with that of the bottom left submatrix. The other 2 submatrices are not touched. • Now recursively do this until all of the elements are swapped.

Example. • We want the transpose of the following matrix: a b c d A = e f g h i j k l m n o p

We use 16 processors with the following indices: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

Drawing a hypercube for this: 6 7 14 15 4 5 12 13 2 3 10 11 0 1 8 9

Processor 0 – binary 0000 holds a00 which is the value a Processor 1 – binary 0001 holds a01 which is the value b Processor 2 – binary 0010 holds a02 which is the value c And so on In the first iteration m = 2q-1 where q = 2 in this example  m = 3. Step 1: Each Pu for u3≠ u1 sends their element of Au to Pu(3) which stores the value in Bu(3) i.e. processors 2, 3, 6 & 7 send to processors 10, 11, 14 & 15 respectively. And processors 8, 9, 12 & 13 send to processors 0, 1, 4 & 5 respectively.

Step 2: Each processor that received a data in Step 1, will now send the data from Bu to Pu(1) to be stored in Au(1), i.e. • Processors 0, 1, 4, 5 send to 2, 3, 6, 7 respectively • Processors 10, 11, 14, 15 send to 8, 9, 12, 13 respectively • By the end of the first iteration our matrix A will look like: a b i j A = e f m n c d k l g h o p

In the second iteration when m = q = 2: • Step 1: Each Pu (where u2≠ u0 ) sends Au to Pu(2) storing it in Bu(2). This is a simultaneous transfer: From: processor 4 to processor 0 processor 1 to processor 5 processor 6 to processor 2 processor 3 to processor 7 processor 12 to processor 8 processor 9 to processor 13 processor 14 to processor 10 processor 11 to processor 15

Step 2: For u2 = u0, each Pu sends Bu to Pu(0) where it is stored in Au(0) thus Swap the element in the top right corner processor with that in the bottom left corner for each of the 2 x 2 submatrices. • From: processor 0 to processor 1 processor 5 to processor 4 processor 2 to processor 3 processor 7 to processor 6 processor 8 to processor 9 processor 13 to processor 12 processor 10 to processor 11 processor 15 to processor 14

After the second iteration, we have the following transposed matrix: a ei m A = b fj n c gk o d hl p

Algorithm Analysis • It takes q constant time iterations giving t(n) = O(log n) • But it takes n2 processors. • Therefore Cost = (n2 log n) which is not optimal.

Bibliography • Akl, Parallel Computation, Models and Methods, Prentice Hall 1997. • Drake, J.B. and Luo, Q., A scalable Parallel Strassen’s matrix Multiplication Algorithm For Distributed-Memory Computers, February 1995 Proceedings of the 1995 ACM symposium on Applied computing , 221- 226 • Gupta, H & Sadayappan P., Communication Efficient Matrix Mulitplication on Hypercubes, August 1994 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, 320 - 329 • Quinn, M.J., Parallel Computing – Theory and Practice, McGraw Hill, 1997

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: