1 / 26

Parallel and Distributed Processing CSE 8380

Parallel and Distributed Processing CSE 8380. February 8, 2005 Session 8. Contents. Computing sum on EREW PRAM Computing all partial sums on EREW PRAM Matrix Multiplication on CREW Other Algorithms. Recall (PRAM Model). Control. Private Memory. P 1.

arich
Download Presentation

Parallel and Distributed Processing CSE 8380

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel and Distributed ProcessingCSE 8380 February 8, 2005 Session 8

  2. Contents • Computing sum on EREW PRAM • Computing all partial sums on EREW PRAM • Matrix Multiplication on CREW • Other Algorithms

  3. Recall (PRAM Model) Control Private Memory P1 • Synchronized Read Compute Write Cycle • EREW • ERCW • CREW • CRCW • Complexity: T(n), P(n), C(n) Global Private Memory P2 Memory Private Memory Pp

  4. Sum on EREW PRAM • Compute the sum of an array A[1..n] • We use n/2 processors • Summation will end up in location A[n] • For simplicity, we assume n is an integral power of 2 • Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.

  5. 5 5 5 5 7 2 7 7 10 10 10 10 11 18 1 18 8 8 8 8 20 20 20 12 7 7 7 7 10 3 30 48 Example Active processors A[2] A[3] A[5] A[6] A[8] A[1] A[4] A[7] Sum of an array of numbers on the EREW model P1, P2, P3, P4 P2, P4 P4 Example of algorithm Sum_EREW when n=8

  6. Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

  7. Algorithm sum_EREW for i =1 to log n do forall Pj, where 1 < j < n/2 do in parallel if (2j mod 2i) = 0 then A[2j]  A[2j] + A[j – 2i-1] endif endfor endfor

  8. Complexity • Run time: T(n) = O(log n) • Number of processors: P(n) = n/2 • Cost: c(n) = O(n log n) • Is it cost optimal?

  9. All partial sums - EREW PRAM • Compute all partial sums of an array A[1..n] • These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n]. • At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1 • We’ll see that it can be parallelized • Let’s extend sum_EREW to do that

  10. All partial sums (cont.) • We noticed that in sum_EREW most processors are idle most of the time • By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum

  11. All partial sums (cont.) • Compute all partial sums of A[1..n] • We use n-1 processors (P2, P3, …, Pn) • A[k] will be replaced by the sum of all elements preceding and including A[k] • In algorithm sum_EREW, at iteration i, only n/2i processors were active, while in allsums_EREW, nearly all processors will be in use.

  12. 5 5 5 5 7 2 7 7 17 12 17 10 11 18 1 18 26 21 9 8 38 31 20 12 7 28 45 19 10 3 30 48 Example Active processors A[2] A[3] A[5] A[6] A[8] A[1] A[4] A[7] P2, P3, …, P8 All partial sums on EREW PRAM P3, P4, …, P8 P5, P6, P7, P8 Example of algorithm allsums_EREW when n=8

  13. Group Work 1- Discuss the algorithm with your neighbor 2- Design the main loops 3- Discuss the Complexity

  14. Algorithm allsums_EREW for i =1 to log n do forall Pj, where 2i-1 + 1 < j < n do in parallel a[j]  A[j] + A[j – 2i-1] endfor endfor

  15. Complexity • Run time: T(n) = O(log n) • Number of processors: P(n) = n-1 • Cost: c(n) = O(n log n)

  16. Matrix Multiplication • Two n X n matrices • For clarity, we assume n is power of 2 • We use CREW to allow concurrent read • Two matrices in the shared memory A[1..n,1..n], B[1..n,1..n]. • We will use n3 processors • We will also show how to reduce the number of processors

  17. Matrix Multiplication (cont) • The n3 processors are arranged in a three dimensional array. Processor Pi,j,k is the one with index (i,j,k) • We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space. • The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n

  18. Two steps • All n3 processors operate in parallel to compute n3 multiplications. (For each of the n2 cells in the output matrix, n products are computed) • The n products are summed to produce the final value of each cell

  19. Matrix multiplicationUsing n3 processors Two steps of the Algorithms • Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k]. • The idea of Algorithm Sum_EREW is applied along the k dimension n2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

  20. Algorithm MatMult_CREW /* step 1 */ forall Pi,j,k, where 1 < i, j, k<n do in parallel C[i,j,k]  A[i,k] * B[k,j] Endfor /* step 2 */ for i=1 to log n do forall Pi,j,k, where 1 < i, j<n & 1<k<n/2 do in parallel if (2k mod 2l) = 0 then C[i,j,2k]  C[i,j,2k] + C[i,j, 2k-2l-1] endif endfor /* the output matrix is stored in locations C[i,j,n], where l<i, j<n */ endfor

  21. Complexity • Run time: T(n) = O(log n) • Number of processors: P(n) = n3 • Cost: c(n) = O(n3 log n) • Is it cost optimal?

  22. j Example P1,1,1 P1,2,1 i K = 1 C[1,1,1]  A[1,1]B[1,1] C[1,2,1]  A[1,1]B[1,2] P2,1,1 P2,2,1 C[2,1,1]  A[2,1]B[1,1] C[2,2,1]  A[2,1]B[1,2] j i P1,1,2 P1,2,2 K = 2 C[1,1,2]  A[1,2]B[2,1] C[1,2,2]  A[1,2]B[2,2] P2,1,2 P2,2,2 C[2,1,2]  A[2,2]B[2,1] C[2,2,2]  A[2,2]B[2,2] After step 1 Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

  23. j Example (cont.) i P1,1,2 P1,2,2 K = 2 C[1,1,2]  C[1,1,2]+C[1,1,1] C[1,2,2]  C[1,2,2]+C[1,2,1] P2,1,2 P2,2,2 Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW C[2,1,2]  C[2,1,2]+C[2,1,1] C[2,2,2]  C[2,2,2]+C[2,2,1] After step 2

  24. Matrix multiplicationreducing the number of processors to n3/log n • Processors are arranged in n X n X n/(log n) 3-dimensional array • Each processors Pi,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n3/log n) partial sums. • The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously. • Complexity analysis • Run time, T(n) = O(log n) • Number of processors, P(n) = n3/log n • Cost, c(n) = O(n3)

  25. Searching • Given A = a1, a2, …, ai, …, an & x • Determine whether x = ai for some i • Sequential Binary Search  O(log n) • Simple idea • Divide the list among the processors and let each processor conduct its own binary search • EREW PRAM  O(log n/p) + O(log p) = O(log n) • CREW  O(log n/p)

  26. Parallel Binary Search • Split A into p+1 segments of almost equal length • Compare x with p elements at the boundary between successive segments • Either x = ai or search is restricted to only one of the p+1 segments • Repeat until x is found or length of the list is <= p

More Related