Parallel Processing

Parallel Processing A Perspective Hardware and Software

Introduction to PP • From “Efficient Linked List Ranking Algorithms and Parentheses Matching as a New Strategy for Parallel Algorithm Design”, R. Halverson • Chapter 1 – Introduction to Parallel Processing

Parallel Processing Research • 1980’s – Great Deal of Research & Publications • 1990’s – Hardware not too successful so the research area “dies” – Why??? • Early 2000’s – Begins Resurgence? Why??? Will it continue to be successful this time ??? • 2010 – Multicore, Graphics processing units

Goal of PP • Why bother with Parallel Processing? • Goal: Solve problems faster! • In reality, faster but efficient! • Work-Optimal: parallel algorithm runs faster than sequential algorithm in proportion to the number of processors used. • Sometimes work-optimal is not possible

Performance Metrics • Running time • Speedup = T1 /Tp • Work = p * Tp • Work Optimal: Work = O(T1) • Linear speedup • Scalable: if algorithm performance increases linearly with the number of processors utilized

PP Issues • Processors: number, connectivity, communication • Memory: shared vs. local • Data structures • Data Distribution • Problem Solving Strategies

Parallel Problems • One Approach: Try to develop a parallel solution to a problem without consideration of the hardware. • Apply: Apply the solution to the specific hardware and determine the extra cost, if any • If not acceptably efficient, try again!

Parallel Problems • Another Approach: Armed with the knowledge of strategies, data structures, etc. that work well for a particular hardware, develop a solution with a specific hardware in mind. • Third Approach: Modify a solution for one hardware configuration for another

Real World Problems • Inherently Parallel – nature or structure of the problem lends itself to parallelism • Examples • Mowing a lawn • Cleaning a house • Grading papers • Problems are easily divided into sub-problems; very little overhead

Real World Problems • Not Inherently Parallel – parallelism is possible but more complex to define or with (excessive) overhead cost • Examples • Balancing a checkbook • Giving a haircut • Wallpapering a room • Prefix sums of an array

Some Computer Problems • Are these “inherently parallel” or not? • Processing customers’ monthly bills • Payroll checks • Building student grade reports from class grade sheets • Searching for an item in a linked list • A video game program • Searching a state driver’s license database • Is problem hw, sw, data? Assumptions?

General Observations • What characteristics make a problem inherently parallel? • What characteristics make a problem difficult to parallelize? * Consider hw, sw, data structures.

Payroll Problem • Consider 10 PUs (processing unit) with each employee’s information stored in a row of an array, A. • Label P0, P1,…P9 • A[100] – 0 to 99 For i = 0 to 9 Pi process A[i*10] to A[((i+1)*10)-1]

Code for Payroll For i = 0 to 9 Pi process A[i*10] to A[((i+1)*10)-1] Each PU runs a process in parallel For each Pi , i = 0 to 9 do //separate process For j = 0 to 9 Process A[i*10 + j]

Time Complexity of Payroll Algorithm • Consider P processors • Consider N data items • Each PC has N/P data items • Assume data is accessible & writeable to each PC • Time for each P: O(N/P) • Work = P * O(N/P) = O(N) • Overhead??

Payroll Questions?? • Now we have a solution, must be applied to hardware. Which hardware? • Main question: Where is the array and how is it accessed by each processor? • One shared memory or many local memories? • Where are the results placed?

What about I/O?? • Generally, in parallel algorithms, I/O is disregarded. • Assumption: Data is stored in the available memory. • Assumption: Results are written back to memory. • Data input and output are generallyindependent of the processing algorithm.

Balancing a Checkbook • Consider same hardware & data array • Can still distribute and process in the same manner as the payroll • Each block computes deposits as addition & checks a subtraction; totals the page (10 totals) • BUT then must combine the 10 totals to the final total • This is the overhead!

Complexity of Checkbook • Consider P processors • Consider N data items • Each PC has N/P data items • Assume data is accessible & writeable • Time for each section: O(N/P) • Combination of P subtotals • Time for combining: O(P) to O(log P) • Depends on strategy used • Total: O(N/P + P) to O(N/P + log P)

Performance Complexity - Perfect Parallel Algorithm • If the best sequential algorithm for a problem is O(f(x)) then the parallel algorithm would be O(f(x)/P) • This happens if little or no overhead Actual Run Time • Typically, takes 4 processors to achieve ½ the actual run time

Performance Measures • Run Time: not a practical measurement Assume T1 & Tp are run times using 1 & p processors, respectively • Speedup: S = T1/Tp • Work: W = p * Tp (aka Cost) • If W = O(T1) the it is Work (Cost) Optimal & achieves Linear Speedup

Scalability • An algorithm is said to be Scalable if performance increases linearly with the number of processors • i.e. double the processors, cut time in half • Implication: Algorithm sustains good performance over a wide range of processors.

Scalability • What about continuing to add processors? • At what point does adding more processors stop improving the run time? • Does adding processors ever cause the algorithm to take more time? • What is the optimal number of processors? • Consider W = p * Tp = O(T1) • Solve for p • Want W to be optimal for different numbers of processors

Models of Computation • Two major categories • Shared memory • e.g. PRAM • Fixed connection • e.g. Hypercube • There are numerous versions of each • Not all are totally realizable in hw

Sidenote: Models • Distributed Computing • Use of 2 or more separate computers used to solve a single problem • A version of a network • Clusters • Not really a topic for this course

Shared Memory Model • PRAM – parallel random access machine • A category with 4 variants • EREW-CREW-ERCW-CRCW • All communication through a shared global memory • Each PC has a small local memory

Variants of PRAM • EREW-CREW-ERCW-CRCW • Concurrent read: 2 or more processors may read the same (or different) memory location simultaneously • Exclusive read: 2 or more processors may access global memory location only if each is accessing a unique address • Similarly defined for write

Shared Memory Model M M M M P0 P1 P2 P3 Shared Global Memory

Shared Memory • What are some implications of the variants in memory access of the PRAM model? • What is the strongest model?

Fixed Connection Models • Each PU contains a Local Memory • Distributed memory • No shared memory • PUs are connected through some type of Interconnection Network • Interconnection network defines the model • Communication is via Message Passing • Can be synchronous or asynchronous

Interconnection Networks • Bus Network (Linear) • Ring • Mesh • Torus • Hypercube

Hypercube Model • Distributed memory, message passing, fixed connection, parallel computer • N = 2r number of nodes • E = r 2r-1 number of edges • Nodes are numbered 0 – N in binary such that any 2 nodes differing in one bit are connected by an edge • Dimension is r

Hypercube ExamplesN = 2, 4 10 11 0 1 00 01 N = 2 Dimension = 1 N = 4 Dimension = 2

Hypercube ExampleN = 8 111 110 N = 8 Dimension = 3 010 011 100 101 001 000

Hypercube Considerations • Message Passing • Communication • Possible Delays • Load Balancing • Each PC has same work load • Data Distribution • Must follow connections

Consider Checkbook Problem • How about distribution of data? • Often initial distribution is disregarded • What about the combination of the subtotals? • Reduction is by dimension • O(log P) = r

Design Strategies • Paradigm: a general strategy used to aid in the development of the solution to a problem

ParadigmsExtended from Sequential Use • Extended from sequential use • Divide-and-Conquer • Always used in parallel algorithms • Divide data vs. Divide code • Branch-and-Bound • Dynamic Programming

ParadigmsDeveloped for Parallel Use • Deterministic coin tossing • Symmetry breaking • Accelerating cascades • Tree contraction • Euler Tours • Linked List Ranking • All Nearest Smaller Values (ANSV) • Parentheses Matching

Divide-and-Conquer • Most basic parallel strategy • Used in virtually every parallel algorithm • Problem is divided into several sub-problems that can be solved independently; results of sub-problems are combined into the final solution • Example: Checkbook Problem

Dynamic Programming • Divide & Conquer technique used when sub-problems are not independent; share common sub-problems • Sub-problem solutions are stored in table for use by other processes • Often used for optimization problems • Minimum or Maximum • Fibonacci Numbers

Branch-and-Bound • Breadth-first tree processing technique • Uses a bounding function that allows some branches of the tree to be pruned (i.e. eliminated) • Example: Game programming

Symmetry Breaking • Strategy that breaks a linked structure (e.g. linked list) into disjoint pieces for processing • Deterministic Coin Tossing • Using a binary representation of index, nonadjacent elements are selected for processing • Often used in Linked List Ranking Algorithms

Accelerated Cascades • Applying 2 or more algorithms to a single problem, • Change from one to another based on the ratio of the problem size to the number of processors – Threshold • This “fine tuning” sometimes allows for better performance

Tree Contraction (aka Contraction) • Nodes of a tree are removed; information removed is combined with remaining nodes • Multiple processors are assigned to independent nodes • Tree is reduced to a single node which contains the solution • E.G. Arithmetic Expression computation; Addition of a series of numbers

Euler Tour • Create duplicate nodes in a tree or graph with edge in opposite direction to create a circuit • Allows tree or graph to be processed as a linked list

Linked List Ranking • Halverson’s area of dissertation research • ProblemDefinition: • Given a linked list of elements, number the elements in order (or in reverse order) • For list of length 20+

Linked List Ranking • Euler Tours • Tree Searches • List packing • Connectivity • Tree Traversals • Spanning Trees & Forests • Connected Components • Graph Decomposition Applied to a wide range of problems

All Nearest Smaller Values • Given a sequence of values, for each value x, which predecessor elements are smaller than x • Successfully applied to • Depth first search of interval graph • Parentheses matching • Line Packing • Triangulating a monotone polynomial

Parentheses Matching • In a properly formed string of parentheses, find the index of each parentheses mate • Applied to solve • Heights of all nodes in a tree • Extreme values in a tree • Lowest common ancestor • Balancing binary trees

Parallel Processing