Parallel Algorithms

Parallel Algorithms Lecture Notes

Motivation • Programs face two perennial problems:: • Time: Run faster in solving a problem • Example: speed up time needed to sort 10 million records • Size: Solve a “bigger” problem • Example: multiply matrixes of big dimensions : PC with 512MB RAM, can store a max size of 8192*8192 elems of a double type of 8 bytes) • Possible solution: parallelism • Split a problem into several tasks and perform these in parallel • A parallel computer: a broad definition: a set of processors that are able to work cooperatively to solve a computational problem • Includes: parallel supercomputers, clusters of workstations, multiple-processor workstations

Concepts … parallel concurrent multiprocessing multiprogramming distributed

Logical vs physical parallelism Program executed on a system with 3 processors Physical parallelism Multi-processing Program executed on a system with 1 processor Logical parallelism Multi-programming A concurrent program, 3 processes 0 0 P0 P0 P2 P1 Proces P0 P1 P0 Proces P1 P2 P2 P0 P1 Proces P2 P2 T T

concurrent-distributed-parallel

parallel distributed

Parallelizing sequential code • The enabling condition for doing 2 tasks in parallel: no dependences between them ! • Parallelizing compilers: compile sequential programs into parallel code • Research goal since the 1970’s

Example: Adding n numbers Sequential solution: sum = 0; for (i=0; i<n; i++) { sum += A[i]; } O(n) The sequential algorithm cannot be straightforward parallelized, since every instruction depends on the previous one

Summing in sequence Always O(n) Summing in pairs P=1: O(n) P=n/2: O(log n) Parallelizing = re-thinking algorithm !

It’s not likely a compiler will produce a good parallel code from a sequential specification any time soon… • Fact: For most computations, a “best” sequential solution (practically, not theoretically) and a “best” parallel solution are usually fundamentally different … • Different solution paradigms imply computations are not “simply” related • Compiler transformations generally preserve the solution paradigm • Therefore the programmer must discover the parallel solution !!!

Sequential vs parallel programming • Has different costs, different advantages • Requires different, unfamiliar algorithms • Must use different abstractions • More complex to understand a program’s behavior • More difficult to control the interactions of the program’s components • Knowledge/tools/understanding more primitive

Example: Count number of 3’s Sequential solution: count = 0; for (i=0; i<length; i++) { if (array[i]==3) count ++; } O(n)

Example: Trial solution 1 • Divide array into t=4 chunks • Assign each chunk to a different concurrent task identified by id=0...t-1 • Code of each task: int length_per_thread = length/t; int start = id * length_per_thread; for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) count += 1; } Problem: Race condition ! This is not a correct concurrent program Accesses to the same shared mem (variable count) should be protected

Example: Trial solution 2 • Correct previous trial solution by adding mutex locks in order to prevent concurrent accesses to shared variable count • Code of each task: mutex m; for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) { mutex_lock(m); count ++; mutex_unlock(m); } } Problem: VERY slow ! There is no real parallelism, tasks wait after each other all the time

Example: Trial solution 3 • Each processor adds into its own private counter, combine partial counts at the end • Code of each task: for (i=start; i<start+length_per_thread; i++) { if (array[i] == 3) { private_count [id] ++; } mutex_lock(m); count+=private_count[id]; mutex_unlock(m); Problem: STILL no speedup measured when using more than 1 processor ! Reason: false sharing

Example: false sharing

Example: solution 4 • Forcing each private counter to be on a separate cache line, by “padding” them with “unused” locations struct padded_int { int value; char padding[128]; } private_count[MaxThreads]; Finally a speedup is measured when using more than 1 processor ! Conclusion: producing correct and efficient parallel programs can be considerably more difficult than writing correct and efficient serial programs !!!

Sequential vs parallel programming • Has different costs, different advantages • Requires different, unfamiliar algorithms • Must use different abstractions • More complex to understand a program’s behavior • More difficult to control the interactions of the program’s components • Knowledge/tools/understanding more primitive

Goals of Parallel Programming • Performance: Parallel program runs faster than its sequential counterpart (a speedup is measured) • Scalability: as the size of the problem grows, more processors can be “usefully” added to solve the problem faster • Portability: The solutions run well on different parallel platforms

Parallel Algorithms