Design of parallel algorithms

Design of parallel algorithms Matrix operations J. Porras

Matrix x vector • Sequential approach MAT_VECT(A,x,y) for(i=0;i<n;i++) { y[i] = 0; for(j=0;j<n:j++) { y[i] = y[i] + A[i,j] * x[j] } } • Work = n2

Parallelization of matrix operationsMatrix x vector • Three ways to implement • rowwise striping • columnwise striping • checkerboarding • DRAW each of these approaches !

Rowwise striping • N x N is distributed into n processors (one row each) • N x 1 vector is distributed into n processors (one element each) • All processors need the whole vector so all-to-all broadcast is required

Rowwise striping • All-to-all broadcast requires q(n). • One row takes q(n) time for multiplications • Rows are calculated in parallel thus the total time is q(n) and the work q(n2). • Algorithm is cost-optimal

Block striping • Assume that p < n and the matrix in partitioned by using block striping • All processors contain n/p rows and n/p elements of the vector • All processors require the whole vector thus all-to-all broadcast is required (message size n/p)

Block striping in hypercube • all-to-all broadcast in hypercube with n/p-sized message takes tslog p + tw(n/p)(p-1) • If p is considered large enough tslog p + twn • Multiplication requires n2/p time (n/p rows to multiply with the vector)

Block striping in hypercube • Parallel execution time TP = n2/p + tslog p + twn • Cost pTP n2 + ts plog p + twnp • Algorithm is costoptimal if p = O(n)

Block striping in mesh • All-to-all broadcast in mesh with wraparounds takes 2ts(Öp-1) + tw(n/p)(p-1) • Parallel execution requires TP = n2/p + 2ts (Öp-1) + twn

Scalability of block striping • Overhead (T0 = pTp – W) T0 = ts plog p + twnp • Isoeffiency (W = KT0) for hypercube W = K ts p log p W = K tw np • Since W = n2, W = K2 tw2 p2

Scalability of block striping • Because p = O(n), n = W(p) n2 = W(p2) W = W(p2) • Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency

Scalability of block striping • Isoeffiency in hypercube is q(p2). • Similar analysis can be done for the mesh architecture and get the same value q(p2). • Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh

Checkerboard • N x N matrix in partitioned into N2 processors (one element per processor) • N x 1 vector is located on a last column (or on a diagonal) • Vector is distributed into corresponding processors • Calculate multiplications in parallel and collect results with single node accumulation into the last processor

Checkerboard • Three communication stapes are required • One-to-one communication to send the vector onto diagonal • One-to-all broadcast to distributed the elements of the vector • Single-node accumulation to sum the partial results

Checkerboard • Mesh requires q(n) time for all the operations (SF) and hypercube q(log n) • Multiplication happens in constant time • Parallel execution time is q(n) in mesh and q(log n) in hypercube architecture • Cost is q(n3) for the mesh and q(n2log n)for the hypercube • Algorithms are not cost-optimal

Checkerboard p < n2 • Cost-optimality can be achieved if the granularity is increased ?? • Consider two dimensional mesh of p processors in which each processor stores (n/Öp) x (n/Öp) block of the matrix • Simlarly for the vector (n/Öp)

Checkerboard p < n2 • Vector elements are sent to the diagonal • Vector elements are distributed for the other processors • Each processor performs n2/p multiplications and calculates n/p additions • Partial sums are collected with single node accumulation

Scalability ofcheckerboard p < n2 • Assume that the processors are connected in a two dimensional Öp x Öp cut-through routing mesh (no wraparounds) • Sent to diagonal takes ts + twn / Öp + thÖp • One-to-all in columns takes (ts + twn / Öp) log (Öp) + thÖp

Scalability ofcheckerboard p < n2 • Single-node accumulation takes (ts + twn / Öp) log (Öp) + thÖp • Multiplicatios in each processor takes n2/p. • Thus TP = n2/p + tslog p +(tw n / Öp) log p + 3th Öp • T0 = pTP - W gives for the overhead: T0 = tsplog p + tw n Öp log p + 3th p3/2

Scalability ofcheckerboard p < n2 • Isoeffiency for ts: W = Kts plog p • Isoeffiency for tw: W = n2 = K tw n Öp log p n = K tw Öp log p n2 = K2 tw 2 p log 2 p W = K2 tw2p log2p • Isoeffiency for th: W = 3 K thp3/2

Scalability ofcheckerboard p < n2 • If p = O(n2), : p = O(n2) n2 = W(p) W = W(p) • tw and th dominate ts

Scalability ofcheckerboard p < n2 • Concentrate on thq(p3/2) and tw:n q(plog2 p) • Because p3/2 >plog2p only for p > 65536 both of the terms could dominate • Assume that the term q(plog2 p) dominates

Scalability ofcheckerboard p < n2 • Maximum number of processors that can be used costoptimally for the problem size W is determined by plog2 p = O( n2 ) log p + 2 log log p = O( log n ) log p = O (log n)

Scalability ofcheckerboard p < n2 • Substitute log n for log p:n • p log2 n = O (n2 ) p = O ( n2 / log2 n ) • p gives the upper limit for the number of processors that can be used cost-optimally

SF and CT • Parallel execution takes n2 / p + 2ts Öp +3tw n time on p processor mesh with SF routing (isoeffiency q(p2) dueto tw ) • CT routing performs much better • Note that this is true for cases with several elements per processor • HOW about fine-grain case ?

Striped and checkerboard • Comparison shows that checkerboard is faster than striped approach with the same amount of processors • If p > n, striped approach is not available • How about the effect of architecture ? • Scalability ? • Isoefficiency ?

Sequential matrix multiplication • Procedure MAT_MULT(A,B,C)for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j] • n3 work (strassen’s algorithm has better complexity)

Block approach • n/q * n/q submatrices • Procedure BLOCK_MAT_MULT(A,B,C)for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do Ci,j := Ci,j + Ai,k Bk,j • Same complexity n3

Simple parallel approach • Matrices A and B partitioned into p blocks of size(n/p1/2) x (n/p1/2) • Map into p1/2 x p1/2 mesh • Processors P0,0 ... Pp-1,p-1 • Pi,j stores Ai,j and Bi,j and computes Ci,j • Ci,j requires Ai,k and Bk,j • A needs to communicate within rows • B communicates within columns

Performance on hypercube • Requires 2 broadcasts (rows and columns) • message size n2/p • tc = 2(ts log(p)+tw(n2/p)(p-1)) • tm= p(n/p)3=n3/p • Tp = n3/p + ts log p + 2twn2/ p , p » 1

Performance on mesh • Store-and-forward routing • tc = 2(tsp + twn2/ p) • tm=  p(n/  p)3=n3/p • tp = n3/p + 2ts p + 2twn2/  p

Cannon´s algorithm • Partition to blocks as usual • Processors P0,0 - P p-1, p-1 • Pi,j contains Ai,j and Bi,j • rotate block !! • A blocks to the left • B blocks upwards

Fox’s algorithm • Partition to blocks as usual • Pi,j contains Ai,j and Bi,j • Uses one-to-all broadcasts • p iterations • (1) broadcast selected block to row • (2) multiply by B • (3) send B upwards • (4) select Ai,(j+1)mod(p)

DNS • Dekel, Nassimi and Sahni • n3 processors available • use 3D structure • Pi,j,k solves A[i,k]xB[k,j] • C[i,j] = Pi,j,0 +...+ Pi,j,n-1 • (log n) time

DNS for hypercube • 3D structure is mapped into hypercube where n3 = 23d processors • Processor Pi,j,o contains A[i,j] and B[i,j] • 3 steps • (1) move A & B to correct plane • (2) replicate on each plane • (3) single node accumulation

DNS < n3 processors • Processors p = q3, q < n • Partition matrices into (n/q)*(n/q) blocks • Matrices contain q x q submatrices • Since 1<=q<=n, p=[1,n3]

Design of parallel algorithms