430 likes | 537 Views
Basic procedures on processor networks. Presenter : Kuan-Hsin Lin. Outline. Data scattering Matrix-vector multiplication Parallel matrix multiplication Sorting problems. Data scattering. How data is to be mapped Division by points or blocks Division by rows or columns. Data scattering.
E N D
Basic procedures on processor networks Presenter : Kuan-Hsin Lin
Outline • Data scattering • Matrix-vector multiplication • Parallel matrix multiplication • Sorting problems
Data scattering • How data is to be mapped • Division by points or blocks • Division by rows or columns
Data scattering • Divide the matrix into p square blocks Matrix A Network
Data scattering • Division into consecutive rows
Data scattering • Snake configuration division
0 1 4 0 1 4 2 3 3 2 8 12 8 12 Data scattering • Recursive division
0 1 14 15 3 2 13 12 4 7 8 11 5 6 9 10 Data scattering • Recursive division by proximity
Network 0 1 Data scattering • Division by packets of consecutive rows Matrix A 1 2 3 4 5 6 7 8 9 10 11 12
Network 0 1 Data scattering • Circular division by packets of rows Matrix A 1 2 3 4 5 6 7 8 9 10 11 12
Matrix-vector multiplication • Definition • A: nxn matrix • x: a vector of Rn • Computing the product of a matrix by a vector,v=Ax For i = 1 to n v(i)=0 for j=1 to n v(i) = v(i) + A(i,j)*x(j)
Matrix-vector multiplication • Row-oriented allocation • Column-oriented allocation • Allocation by blocks • Pipelined allocation
Matrix-vector multiplication • Row-oriented allocation i =
Matrix-vector multiplication • Algorithm ATA on the n/p local components of vector xFor all processors q from 0 to p-1 in parallel For all k=1 to n/p do v(q+(k-1)p+1)=scalar product of row q+(k-1)p+1 of A and vector x
Matrix-vector multiplication • Column-oriented allocation =
Matrix-vector multiplication • Algorithm For all processors q=0 to p-1 do in parallelFor all k=1 to n/p do v-temporary=multiplication of column q+(k-1)p of A by the component q+(k-1)p+1 of vector x personalized-ATA with accumulation of the temporary vectors v
Matrix-vector multiplication • Allocation by blocks Matrix A Network Vector x
Matrix-vector multiplication • Algorithm (partial) ATA of vector xFor all processors q=0 to p-1 do in parallel partial vector v=matrix-vector product of the block by the partial vector x that has just been received personalized ATA with (partial) accumulation of the partial vectors v
x1x5 x2x6 x3x7 x4x8 Matrix-vector multiplication • Pipeline allocation(First stage) Components of x: Row 1 3 Row 5 7 Proc. 0 Proc. 1 Proc. 2 Proc. 3 2 4 6 8
x4x8 x1x5 x2x6 x3x7 Matrix-vector multiplication • Pipeline allocation(Second stage) Components of x: Row 1 3 Row 5 7 Proc. 0 Proc. 1 Proc. 2 Proc. 3 2 4 6 8
x3x7 x4x8 x1x5 x2x6 Matrix-vector multiplication • Pipeline allocation(Third stage) Components of x: Row 1 3 Row 5 7 Proc. 0 Proc. 1 Proc. 2 Proc. 3 2 4 6 8
Cij Parallel matrix multiplication For i=1 to n For j=1 to n c(i,j)=0 For k=1 to n c(i,j)=c(i,j)+A(i,k)*B(k,j) i = * j
Parallel matrix multiplication • Parallelization on a toric grid C11=A11*B11+A12*B21+A13*B31 C12=A11*B12+A12*B22+A13*B32 C13=A11*B13+A12*B23+A13*B33 C21=A21*B11+A22*B21+A23*B31 C22=A21*B12+A22*B22+A23*B32 C23=A21*B13+A22*B23+A23*B33
A11 A12 A13 B11 B22 B33 [1,1] [1,2] [1,3] A22 A23 A21 B21 B32 B13 [2,1] [2,2] [2,3] A33 A31 A32 B31 B12 B23 [3,1] [3,2] [3,3] Generalization – First stage
A12 A13 A11 B21 B32 B13 [1,1] [1,2] [1,3] A23 A21 A22 B31 B12 B23 [2,1] [2,2] [2,3] A31 A32 A33 B11 B22 B33 [3,1] [3,2] [3,3] Generalization – Second stage
A13 A11 A12 B31 B12 B23 [1,1] [1,2] [1,3] A21 A22 A23 B11 B22 B33 [2,1] [2,2] [2,3] A32 A33 A31 B21 B32 B13 [3,1] [3,2] [3,3] Generalization – Third stage
The link with systolic algorithms • The operations performed by each cell: • Read an operand on channel N:op1=N • Read an operand on channel W:op2=W • Execute the internal operation:R=R+op1*op2 • Transmit an operand on channel S:S=op1 • Transmit an operand on channel E:E=op2
Basic cell in the systolic network N R W E S
Product of square matrices B33 B32 B23 B31 B22 B13 B21 B12 - B11 - - A13 A12 A11 A23 A22 A21 | A33 A32 A31 | |
Adaptation of the computation B33 B32 B23 B31 B22 B13 B21 B12 B33 B11 B32 B23 A13 A12 A11 A23 A22 A21 A23 A33 A32 A31 A33 A32
Fast parallel multiplication C= = M0=(A11+A22)(B11+B22)M1=(A12-A22)(B21+B22) M2=A22 (B21-B11) M3=(A11+A12)B22M4=(A21+A22)B11M5=A11(B12-B22)M6=(A21-A11)(B11+B12) Where
T1=A11+A22T2=B11+B22T3=A12-A22T4=B21+B22T5=B21-B11T6=A11+A12T7=A21+A22T8=B12-B22T9=A21-A11T10=B11+A12 M0=T1*T2M1=T3*T4M2=A22*T5M3=T6*B22M4=T7*B11M5=A11*T8M6=T9*T10T11=M0+M1T12=M2-M3T13=M5-M4 Defined tasks T14=M0+M6T15=T11+T12T16=M2+M4T17=M3+M5T18=T13-T14
The task graph T3 T4 T1 T2 T9 T10 T5 T6 T7 T8 M1 M0 M6 M2 M3 M4 M5 T11 T14 T12 T16 T17 T13 T15 T18
P0 P1 P2 P3 P4 P5 P6 Initial allocation of matrix blocks B22 B22 B12 B11 B21 B21 B22 B11 B22 B11 A22 A22 B11 A12 A22 B12 A21 A11 A12 A22 A11 A21 A11 A11
T1 T3 T6 T8 T9 T2 T4 T5 T7 T10 Execution scheme Stage1: Local computations Stage2: Local computations
M0 M1 T11 M2 T12 M3 M4 T13 M5 M6 T14 Execution scheme M0 M3 M3 M4 M0 Stage3: computations followed by local communications T12 M4 M4 M3 T13 Stage4: computations followed by local communications
T15 T16 T17 T18 Execution scheme Stage5: Local computations
Sorting problems • Odd-even sorting algorithms on a ring{program for processor Pi}for stage=1 to n if even stage then compare-exchange(key[i-1],key[i]) else compare-exchange(key[i],key[i+1])
P0 P1 P6 P2 P3 P7 P4 P5 1 9 2 0 5 3 8 5 Example of odd-even sort
Stage 1 1 9 0 2 5 8 3 5 1 0 9 2 5 3 8 5 Example of odd-even sort List to be sorted 1 9 2 0 8 5 3 5
Stage 2 Stage 3 0 1 2 9 3 5 5 8 0 1 2 3 5 9 5 8 0 1 2 3 9 5 5 8 0 1 2 3 5 5 9 8 Example of odd-even sort
Stage 4 0 1 2 3 5 5 8 9 0 1 2 3 5 5 8 9 Example of odd-even sort