Introduction to Parallel Processing with Multi-core Part II—Algorithms

Introduction to Parallel Processing with Multi-corePart II—Algorithms Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University USA liuj@wou.edu

Part II outline • More about PRAM • Activating PRAM processors • Finding Max in constant amount of time • Algorithms on PRAM • The fan-in algorithm • The list ranking algorithm • The parallel merge algorithm • The prefix sum algorithm • Brent’s theorem and the use of it • Speedup and its calculation • The cost of a parallel algorithm and the Cost Optimal concept • NC class and P Complete • Amdahl’s Law and Gustafson-Barsis’ Law

More About PRAM • Remember, each PRAM processor can either • Perform the prescribed operation (the same for all processors), • Carry out an I/O operation, • Idle, or • Activate another processor • So, it takes n processors to activate another n processors, then we have 2n active processors • Now two questions • What happens if two processors write to the same memory location? • How many steps does it take to activate n processors

Handling Writing Conflicts in PRAM • EREW (Exclusive Read Exclusive Write) • CREW (Concurrent Read Exclusive Write) • CRCW (Concurrent Read Concurrent Write) • Common– all the values are the same • Arbitrary – pick a value and set it • Priority – the processors with the highest priority is the winner • A multi-core computer is which one of the above?

Activating n Processors • Let’s Activate( ) represents activation of by • Let’s if ( ) { } else { }, for { }, while { }, and { } have their standard meanings • Let’s the symbol = demotes assignment operation • Let’s for all <processor list> do {statement list} represent that the code segment to be executed in parallel by all the processors in the processor list Spawn( ) // assuming is already active { for i0 to do for all where 0 <= j < if ( < n) Activate ( ) }

About the procedure spawn • What is the complexity ? • If forms a binomal tree

Finding Max in a constant time • Input: an array of n integers arrA[0..n-1] • Output: the largest of number of arrA[0..n-1] • Global variable arrB[0..n-1], i, j • Assume the computer is a CRCW/Common FindignMax(arrA[0..n-1]) { • for all where 0 <= i < n-1 • arrB[i] = 1 • for all where 0 <= i, j < n-1 • if (arrA[i] < arrA[j]) • arrB[i] = 0 • for all where 0 <= i < n-1 • if (arrB[i] = 1) • print arrA[i] }

Finding Max – how does it work • After line 2, every B[i] is 1 • Line 3 ~ 5 • for all where 0 <= i, j < n-1 • if arrA[i] < arrA[j] • arrB[i] = 0 • Write a 0 to B[i] if A[i] is smaller then an element in A because it is CRCW/Common

Finding Max questions • How to do it sequentially and what is the complexity then? • How to do it in parallel and what is the complexity? • How many processors is needed? • Will the algorithm work if the computer is CRCW/Arbitrary? • On the PRAM, what is the min amount of time required to run the algorithm, assuming only is activated? [Hint: Remember spawn( )] • Other approach of finding the max?

Fan-in algorithm • Also called reduction to calculate where  is associative. When  is +, the calculation is sum. • The figure on the right shows summing of n numbers using processors

Fan-in algorithm (2) FanInTotal(A[0..n-1], n) // n >=1, the sum of array A is in A[0], machine is CREW { Spawn( ) for all where 0 <= i <= for j from 0 to if i mod and }

Fan-in algorithm questions • How to do it sequentially and what is the complexity then? • How is it done in parallel and what is the complexity? • How many processors is needed? • Will the algorithm work if the computer is CRCW or EREW?

Prefix Sum Let be n values and  be an associative operator, the prefix sums is to find the following n quantities PrefixSums(A[0..n-1], n) //n >=1 { • Spawn( ) • for all where 1 <= i <= n - 1 • for j from 0 to • if }

Prefix Sum Questions • How to do it sequentially and what is the complexity then? • How is it done in parallel and what is the complexity? • How many processors is needed? Why isn’t processor 0 used?

List Ranking • We are using an array to represent a linked list • To determine, for each element, the number of elements that is in front of it is called the list ranking problem • How to do list ranking sequentially • Can we perform list ranking in parallel?

Parallel List Ranking—Algorithm ListRamking(next[0..n-1], n) // array next contains the pointers of linked list { pos[0..n-1] // local variable, array of int contains the result – the ranking • Spawn( ) • for all where 0 <= i <= n - 1 • { • pos[i] = 1 • if next[i] = i • pos [i] = 0 • for j from 1 to • { • pos[i] = pos[i] + pos[next[i]] • next[i] = next[next[i]] • } } } Remember PRAM all processors must carry out the same operation

Parallel List Ranking—Algorithm Explained • Key steps … for all where 0 <= i <= n - 1 { … … for j from 1 to { pos[i] = pos[i] + pos[next[i]] next[i] = next[next[i]] } } j = 1  j = 2  j = 3  j = 4 

Parallel List Ranking—Algorithm Explained (2) • Key steps … for all where 0 <= i <= n - 1 { … … for j from 1 to { pos[i] = pos[i] + pos[next[i]] next[i] = next[next[i]] } } j = 1  j = 2  j = 3  j = 4 

Understanding Parallel List Ranking Algorithm • Key steps … for all where 0 <= i <= n - 1 { … … for j from 1 to { pos[i] = pos[i] + pos[next[i]] next[i] = next[next[i]] } } j = 1  j = 2  j = 3  j = 4 

List Ranking Questions • How to do it sequentially and what is the complexity then? • How is it done in parallel and what is the complexity? • How many processors is needed? • What is the key step that make an apparently sequential problem resolved with a concurrent solution? • Will the algorithm work if the computer is CRCW or EREW?

Merging Two Sorted Arrays • The problem: n is an even number. An array of size n stores two sorted sequence of integers of size n/2, we need to merge the two sorted segment in O(log (n)) steps.

Merging Two Sorted Arrays (2) • The sequential approach: two yardsticks • The sequential approach has no concurrency to exploit • Calling for new algorithms • Key idea: if we know there are kelements smaller than A[i], we can copy A[i] to A[k] in one step. • If i<n/2, then there are i -1 elements smaller than A[i] (assuming array is 1 based). Now how can we find the number of elements in the second half of A that is also smaller than A[i] – binary search(a log (n) algorithm)!

Merging Two Sorted Arrays In Parallel //A[1] to A[n/2] and A[n/2 +1] to A[n] are two sorted sections MergeArray(A[1..n]) { int x, low, high, index for all where 1 <= i <= n // The lower half search the upper half, the upper half search for the lower half { high = 1 // assuming it is the upper half low = n/2 If i <= (n/2) { high = n low = (n/2) + 1} x = A[i] // perform binary search Repeat { index = If x < A[index] high = index – 1 else low = index + 1 } until low > high A[high + I – n/2] = x } }

Brent’s Theorem • Given A, a parallel algorithm with computation time t, if parallel algorithm A perform m computational operations, then p processors can execute algorithm A in time t + (m – t)/p. • Proof: Let being the number of computational operations performed by A at step i, where 1≤ i <≤ t. By definition, we have . Using p processors we can simulate computational operations at step i in time • Therefore,

Applying Brent’s Theorem • For the Sum algorithm, the execution time is ; however, the total amount of computation is n-1. • Noticing that we use n/2 processors during the first step, n/2/2 processors the second step, … …, and 1 processor the last step. However, we have allocated n/2 processors initially to the problem. So, after each step more and more processors are idling. • If we only assign processors, then the execute time is, according to Brent’s theorem • That is, by reducing the number of processor to does not change the complexity of the parallel algorithm.

Cost of a parallel algorithm • The cost of a parallel algorithm is defined to be the product of the algorithm’s time complexity and the number of processors used. • The original Sum algorithm has a cost of(n*log n). • The algorithm that uses processors has a cost of • Note that (n) is the same as the sequential algorithm. • A cost optimal parallel algorithm is an algorithm for which the cost is in the same complexity class as an optimal sequential algorithm for the same problem. • Sum using processors is an example of cost optimal. • Using n*n processor to find the Max in a constant time is not an example of cost optimal.

NC class and P Complete • NC is the class of problems solvable on a PRAM in poly-logarithmic time using a number of processors that are a polynomial function of the problem size. • All, except the Finding Max, algorithms we have discussed are in NC. This is the class of problems we are interested in finding parallel solutions • P is a class of problems solvable, sequentially, in polynomial time • A problem L  Pis P-complete if every other problem in P can be transformed to L in ploy-logarithmic time using PRAM with a polynomial number of processor. Notes the transformation is in NC. • Example of P-complete problems are depth-first search of an arbitrary graph and the circuit value problem. • P-complete is a class of problems appear to not have a parallel solution  we just cannot prove it yet!

Speedup – Take II • Speedup = • Parallelizability ratio of time on one CPU vs. that on p CPU. • For most of the parallel algorithms, its speedup is less than n, the number of processors • If an algorithm enjoys speedup of n when n is large, we consider the algorithm scalable • Super lineal speedup is when the speedup of an algorithm is greater than n, the number of processors • This can happen if the parallel algorithm introduces a new approach of solving the problem • The parallel algorithm utilizes the cache memory for efficiently • The parallel algorithm gets lucky. For example, performing breadth first search.

Amdahl’s Law, and Gustafson-Barsis’s Law • Amdahl’s Law: Let s be the fraction of operations in a computation that must be performed sequentially, where 0≤ s ≤ 1. The maximum speedup achievable by a parallel computer with p processors performing the computation is • Gustafson-Barsis’s Law: Given a parallel program solving a problem using p processors, let s denote the fraction of the total execution performed sequentially. The maximum speedup achievable by this program is • In a way, these two law’s contradicts with each other. How can we explain this contradiction?

Part II outline • More about PRAM • Activating PRAM processors • Finding Max in constant amount of time • Algorithms on PRAM • The fan-in algorithm • The list ranking algorithm • The parallel merge algorithm • The prefix sum algorithm • Brent’s theorem and the use of it • Speedup and its calculation • The cost of a parallel algorithm and the Cost Optimal concept • NC class and P Complete • Amdahl’s Law and Gustafson-Barsis’ Law

Introduction to Parallel Processing with Multi-core Part II—Algorithms