Basic procedures on processor networks

Basic procedures on processor networks Presenter : Kuan-Hsin Lin

Outline • Data scattering • Matrix-vector multiplication • Parallel matrix multiplication • Sorting problems

Data scattering • How data is to be mapped • Division by points or blocks • Division by rows or columns

Data scattering • Divide the matrix into p square blocks Matrix A Network

Data scattering • Division into consecutive rows

Data scattering • Snake configuration division

0 1 4 0 1 4 2 3 3 2 8 12 8 12 Data scattering • Recursive division

0 1 14 15 3 2 13 12 4 7 8 11 5 6 9 10 Data scattering • Recursive division by proximity

Network 0 1 Data scattering • Division by packets of consecutive rows Matrix A 1 2 3 4 5 6 7 8 9 10 11 12

Network 0 1 Data scattering • Circular division by packets of rows Matrix A 1 2 3 4 5 6 7 8 9 10 11 12

Matrix-vector multiplication • Definition • A: nxn matrix • x: a vector of Rn • Computing the product of a matrix by a vector,v=Ax For i = 1 to n v(i)=0 for j=1 to n v(i) = v(i) + A(i,j)*x(j)

Matrix-vector multiplication • Row-oriented allocation • Column-oriented allocation • Allocation by blocks • Pipelined allocation

Matrix-vector multiplication • Row-oriented allocation i =

Matrix-vector multiplication • Algorithm ATA on the n/p local components of vector xFor all processors q from 0 to p-1 in parallel For all k=1 to n/p do v(q+(k-1)p+1)=scalar product of row q+(k-1)p+1 of A and vector x

Matrix-vector multiplication • Column-oriented allocation =

Matrix-vector multiplication • Algorithm For all processors q=0 to p-1 do in parallelFor all k=1 to n/p do v-temporary=multiplication of column q+(k-1)p of A by the component q+(k-1)p+1 of vector x personalized-ATA with accumulation of the temporary vectors v

Matrix-vector multiplication • Allocation by blocks Matrix A Network Vector x

Matrix-vector multiplication • Algorithm (partial) ATA of vector xFor all processors q=0 to p-1 do in parallel partial vector v=matrix-vector product of the block by the partial vector x that has just been received personalized ATA with (partial) accumulation of the partial vectors v

x1x5 x2x6 x3x7 x4x8 Matrix-vector multiplication • Pipeline allocation(First stage) Components of x: Row 1 3 Row 5 7 Proc. 0 Proc. 1 Proc. 2 Proc. 3 2 4 6 8

x4x8 x1x5 x2x6 x3x7 Matrix-vector multiplication • Pipeline allocation(Second stage) Components of x: Row 1 3 Row 5 7 Proc. 0 Proc. 1 Proc. 2 Proc. 3 2 4 6 8

x3x7 x4x8 x1x5 x2x6 Matrix-vector multiplication • Pipeline allocation(Third stage) Components of x: Row 1 3 Row 5 7 Proc. 0 Proc. 1 Proc. 2 Proc. 3 2 4 6 8

Cij Parallel matrix multiplication For i=1 to n For j=1 to n c(i,j)=0 For k=1 to n c(i,j)=c(i,j)+A(i,k)*B(k,j) i = * j

Parallel matrix multiplication • Parallelization on a toric grid C11=A11*B11+A12*B21+A13*B31 C12=A11*B12+A12*B22+A13*B32 C13=A11*B13+A12*B23+A13*B33 C21=A21*B11+A22*B21+A23*B31 C22=A21*B12+A22*B22+A23*B32 C23=A21*B13+A22*B23+A23*B33

A11 A12 A13 B11 B22 B33 [1,1] [1,2] [1,3] A22 A23 A21 B21 B32 B13 [2,1] [2,2] [2,3] A33 A31 A32 B31 B12 B23 [3,1] [3,2] [3,3] Generalization – First stage

A12 A13 A11 B21 B32 B13 [1,1] [1,2] [1,3] A23 A21 A22 B31 B12 B23 [2,1] [2,2] [2,3] A31 A32 A33 B11 B22 B33 [3,1] [3,2] [3,3] Generalization – Second stage

A13 A11 A12 B31 B12 B23 [1,1] [1,2] [1,3] A21 A22 A23 B11 B22 B33 [2,1] [2,2] [2,3] A32 A33 A31 B21 B32 B13 [3,1] [3,2] [3,3] Generalization – Third stage

The link with systolic algorithms • The operations performed by each cell: • Read an operand on channel N:op1=N • Read an operand on channel W:op2=W • Execute the internal operation:R=R+op1*op2 • Transmit an operand on channel S:S=op1 • Transmit an operand on channel E:E=op2

Basic cell in the systolic network N R W E S

Product of square matrices B33 B32 B23 B31 B22 B13 B21 B12 － B11 －－ A13 A12 A11 A23 A22 A21 　| A33 A32 A31 　|　|

Adaptation of the computation B33 B32 B23 B31 B22 B13 B21 B12 B33 B11 B32 B23 A13 A12 A11 A23 A22 A21 A23 A33 A32 A31 A33 A32

Fast parallel multiplication C= = M0=(A11+A22)(B11+B22)M1=(A12-A22)(B21+B22) M2=A22 (B21-B11) M3=(A11+A12)B22M4=(A21+A22)B11M5=A11(B12-B22)M6=(A21-A11)(B11+B12) Where

T1=A11+A22T2=B11+B22T3=A12-A22T4=B21+B22T5=B21-B11T6=A11+A12T7=A21+A22T8=B12-B22T9=A21-A11T10=B11+A12 M0=T1*T2M1=T3*T4M2=A22*T5M3=T6*B22M4=T7*B11M5=A11*T8M6=T9*T10T11=M0+M1T12=M2-M3T13=M5-M4 Defined tasks T14=M0+M6T15=T11+T12T16=M2+M4T17=M3+M5T18=T13-T14

The task graph T3 T4 T1 T2 T9 T10 T5 T6 T7 T8 M1 M0 M6 M2 M3 M4 M5 T11 T14 T12 T16 T17 T13 T15 T18

P0 P1 P2 P3 P4 P5 P6 Initial allocation of matrix blocks B22 B22 B12 B11 B21 B21 B22 B11 B22 B11 A22 A22 B11 A12 A22 B12 A21 A11 A12 A22 A11 A21 A11 A11

T1 T3 T6 T8 T9 T2 T4 T5 T7 T10 Execution scheme Stage1: Local computations Stage2: Local computations

M0 M1 T11 M2 T12 M3 M4 T13 M5 M6 T14 Execution scheme M0 M3 M3 M4 M0 Stage3: computations followed by local communications T12 M4 M4 M3 T13 Stage4: computations followed by local communications

T15 T16 T17 T18 Execution scheme Stage5: Local computations

Sorting problems • Odd-even sorting algorithms on a ring{program for processor Pi}for stage=1 to n if even stage then compare-exchange(key[i-1],key[i]) else compare-exchange(key[i],key[i+1])

P0 P1 P6 P2 P3 P7 P4 P5 1 9 2 0 5 3 8 5 Example of odd-even sort

Stage 1 1 9 0 2 5 8 3 5 1 0 9 2 5 3 8 5 Example of odd-even sort List to be sorted 1 9 2 0 8 5 3 5

Stage 2 Stage 3 0 1 2 9 3 5 5 8 0 1 2 3 5 9 5 8 0 1 2 3 9 5 5 8 0 1 2 3 5 5 9 8 Example of odd-even sort

Stage 4 0 1 2 3 5 5 8 9 0 1 2 3 5 5 8 9 Example of odd-even sort

Thank you

Basic procedures on processor networks