1 / 46

Design of parallel algorithms

Design of parallel algorithms. Matrix operations J. Porras. Matrix x vector. Sequential approach MAT_VECT(A,x,y) for(i=0;i<n;i++) { y[i] = 0; for(j=0;j<n:j++) { y[i] = y[i] + A[i,j] * x[j] } } Work = n 2. Parallelization of matrix operations Matrix x vector.

oksana
Download Presentation

Design of parallel algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design of parallel algorithms Matrix operations J. Porras

  2. Matrix x vector • Sequential approach MAT_VECT(A,x,y) for(i=0;i<n;i++) { y[i] = 0; for(j=0;j<n:j++) { y[i] = y[i] + A[i,j] * x[j] } } • Work = n2

  3. Parallelization of matrix operationsMatrix x vector • Three ways to implement • rowwise striping • columnwise striping • checkerboarding • DRAW each of these approaches !

  4. Rowwise striping • N x N is distributed into n processors (one row each) • N x 1 vector is distributed into n processors (one element each) • All processors need the whole vector so all-to-all broadcast is required

  5. Rowwise striping • All-to-all broadcast requires q(n). • One row takes q(n) time for multiplications • Rows are calculated in parallel thus the total time is q(n) and the work q(n2). • Algorithm is cost-optimal

  6. Block striping • Assume that p < n and the matrix in partitioned by using block striping • All processors contain n/p rows and n/p elements of the vector • All processors require the whole vector thus all-to-all broadcast is required (message size n/p)

  7. Block striping in hypercube • all-to-all broadcast in hypercube with n/p-sized message takes tslog p + tw(n/p)(p-1) • If p is considered large enough tslog p + twn • Multiplication requires n2/p time (n/p rows to multiply with the vector)

  8. Block striping in hypercube • Parallel execution time TP = n2/p + tslog p + twn • Cost pTP n2 + ts plog p + twnp • Algorithm is costoptimal if p = O(n)

  9. Block striping in mesh • All-to-all broadcast in mesh with wraparounds takes 2ts(Öp-1) + tw(n/p)(p-1) • Parallel execution requires TP = n2/p + 2ts (Öp-1) + twn

  10. Scalability of block striping • Overhead (T0 = pTp – W) T0 = ts plog p + twnp • Isoeffiency (W = KT0) for hypercube W = K ts p log p W = K tw np • Since W = n2, W = K2 tw2 p2

  11. Scalability of block striping • Because p = O(n), n = W(p) n2 = W(p2) W = W(p2) • Equation gives the highest asymptotic rate at which the problem size must increase with the number of processors to maintain fixed efficiency

  12. Scalability of block striping • Isoeffiency in hypercube is q(p2). • Similar analysis can be done for the mesh architecture and get the same value q(p2). • Thus with striped partitioning, scalability is not any more on a hypercube than on a mesh

  13. Checkerboard • N x N matrix in partitioned into N2 processors (one element per processor) • N x 1 vector is located on a last column (or on a diagonal) • Vector is distributed into corresponding processors • Calculate multiplications in parallel and collect results with single node accumulation into the last processor

  14. Checkerboard • Three communication stapes are required • One-to-one communication to send the vector onto diagonal • One-to-all broadcast to distributed the elements of the vector • Single-node accumulation to sum the partial results

  15. Checkerboard • Mesh requires q(n) time for all the operations (SF) and hypercube q(log n) • Multiplication happens in constant time • Parallel execution time is q(n) in mesh and q(log n) in hypercube architecture • Cost is q(n3) for the mesh and q(n2log n)for the hypercube • Algorithms are not cost-optimal

  16. Checkerboard p < n2 • Cost-optimality can be achieved if the granularity is increased ?? • Consider two dimensional mesh of p processors in which each processor stores (n/Öp) x (n/Öp) block of the matrix • Simlarly for the vector (n/Öp)

  17. Checkerboard p < n2 • Vector elements are sent to the diagonal • Vector elements are distributed for the other processors • Each processor performs n2/p multiplications and calculates n/p additions • Partial sums are collected with single node accumulation

  18. Scalability ofcheckerboard p < n2 • Assume that the processors are connected in a two dimensional Öp x Öp cut-through routing mesh (no wraparounds) • Sent to diagonal takes ts + twn / Öp + thÖp • One-to-all in columns takes (ts + twn / Öp) log (Öp) + thÖp

  19. Scalability ofcheckerboard p < n2 • Single-node accumulation takes (ts + twn / Öp) log (Öp) + thÖp • Multiplicatios in each processor takes n2/p. • Thus TP = n2/p + tslog p +(tw n / Öp) log p + 3th Öp • T0 = pTP - W gives for the overhead: T0 = tsplog p + tw n Öp log p + 3th p3/2

  20. Scalability ofcheckerboard p < n2 • Isoeffiency for ts: W = Kts plog p • Isoeffiency for tw: W = n2 = K tw n Öp log p n = K tw Öp log p n2 = K2 tw 2 p log 2 p W = K2 tw2p log2p • Isoeffiency for th: W = 3 K thp3/2

  21. Scalability ofcheckerboard p < n2 • If p = O(n2), : p = O(n2) n2 = W(p) W = W(p) • tw and th dominate ts

  22. Scalability ofcheckerboard p < n2 • Concentrate on thq(p3/2) and tw:n q(plog2 p) • Because p3/2 >plog2p only for p > 65536 both of the terms could dominate • Assume that the term q(plog2 p) dominates

  23. Scalability ofcheckerboard p < n2 • Maximum number of processors that can be used costoptimally for the problem size W is determined by plog2 p = O( n2 ) log p + 2 log log p = O( log n ) log p = O (log n)

  24. Scalability ofcheckerboard p < n2 • Substitute log n for log p:n • p log2 n = O (n2 ) p = O ( n2 / log2 n ) • p gives the upper limit for the number of processors that can be used cost-optimally

  25. SF and CT • Parallel execution takes n2 / p + 2ts Öp +3tw n time on p processor mesh with SF routing (isoeffiency q(p2) dueto tw ) • CT routing performs much better • Note that this is true for cases with several elements per processor • HOW about fine-grain case ?

  26. Striped and checkerboard • Comparison shows that checkerboard is faster than striped approach with the same amount of processors • If p > n, striped approach is not available • How about the effect of architecture ? • Scalability ? • Isoefficiency ?

  27. Sequential matrix multiplication • Procedure MAT_MULT(A,B,C)for i := 0 to n-1 do for j := 0 to n-1 do C[i,j] := 0; for k := 0 to n-1 C[i,j] := C[i,j] + A[i,k]B[k,j] • n3 work (strassen’s algorithm has better complexity)

  28. Block approach • n/q * n/q submatrices • Procedure BLOCK_MAT_MULT(A,B,C)for i := 0 to q-1 do for j := 0 to q-1 do Initialize C to zero for k := 0 to q-1 do Ci,j := Ci,j + Ai,k Bk,j • Same complexity n3

  29. Simple parallel approach • Matrices A and B partitioned into p blocks of size(n/p1/2) x (n/p1/2) • Map into p1/2 x p1/2 mesh • Processors P0,0 ... Pp-1,p-1 • Pi,j stores Ai,j and Bi,j and computes Ci,j • Ci,j requires Ai,k and Bk,j • A needs to communicate within rows • B communicates within columns

  30. Performance on hypercube • Requires 2 broadcasts (rows and columns) • message size n2/p • tc = 2(ts log(p)+tw(n2/p)(p-1)) • tm= p(n/p)3=n3/p • Tp = n3/p + ts log p + 2twn2/ p , p » 1

  31. Performance on mesh • Store-and-forward routing • tc = 2(tsp + twn2/ p) • tm=  p(n/  p)3=n3/p • tp = n3/p + 2ts p + 2twn2/  p

  32. Cannon´s algorithm • Partition to blocks as usual • Processors P0,0 - P p-1, p-1 • Pi,j contains Ai,j and Bi,j • rotate block !! • A blocks to the left • B blocks upwards

  33. Fox’s algorithm • Partition to blocks as usual • Pi,j contains Ai,j and Bi,j • Uses one-to-all broadcasts • p iterations • (1) broadcast selected block to row • (2) multiply by B • (3) send B upwards • (4) select Ai,(j+1)mod(p)

  34. DNS • Dekel, Nassimi and Sahni • n3 processors available • use 3D structure • Pi,j,k solves A[i,k]xB[k,j] • C[i,j] = Pi,j,0 +...+ Pi,j,n-1 • (log n) time

  35. DNS for hypercube • 3D structure is mapped into hypercube where n3 = 23d processors • Processor Pi,j,o contains A[i,j] and B[i,j] • 3 steps • (1) move A & B to correct plane • (2) replicate on each plane • (3) single node accumulation

  36. DNS < n3 processors • Processors p = q3, q < n • Partition matrices into (n/q)*(n/q) blocks • Matrices contain q x q submatrices • Since 1<=q<=n, p=[1,n3]

More Related