1 / 55

Parallel Processing

Parallel Processing. A Perspective Hardware and Software. Introduction to PP. From “Efficient Linked List Ranking Algorithms and Parentheses Matching as a New Strategy for Parallel Algorithm Design”, R. Halverson Chapter 1 – Introduction to Parallel Processing. Parallel Processing Research.

danyl
Download Presentation

Parallel Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Processing A Perspective Hardware and Software

  2. Introduction to PP • From “Efficient Linked List Ranking Algorithms and Parentheses Matching as a New Strategy for Parallel Algorithm Design”, R. Halverson • Chapter 1 – Introduction to Parallel Processing

  3. Parallel Processing Research • 1980’s – Great Deal of Research & Publications • 1990’s – Hardware not too successful so the research area “dies” – Why??? • Early 2000’s – Begins Resurgence? Why??? Will it continue to be successful this time ??? • 2010 – Multicore, Graphics processing units

  4. Goal of PP • Why bother with Parallel Processing? • Goal: Solve problems faster! • In reality, faster but efficient! • Work-Optimal: parallel algorithm runs faster than sequential algorithm in proportion to the number of processors used. • Sometimes work-optimal is not possible

  5. Performance Metrics • Running time • Speedup = T1 /Tp • Work = p * Tp • Work Optimal: Work = O(T1) • Linear speedup • Scalable: if algorithm performance increases linearly with the number of processors utilized

  6. PP Issues • Processors: number, connectivity, communication • Memory: shared vs. local • Data structures • Data Distribution • Problem Solving Strategies

  7. Parallel Problems • One Approach: Try to develop a parallel solution to a problem without consideration of the hardware. • Apply: Apply the solution to the specific hardware and determine the extra cost, if any • If not acceptably efficient, try again!

  8. Parallel Problems • Another Approach: Armed with the knowledge of strategies, data structures, etc. that work well for a particular hardware, develop a solution with a specific hardware in mind. • Third Approach: Modify a solution for one hardware configuration for another

  9. Real World Problems • Inherently Parallel – nature or structure of the problem lends itself to parallelism • Examples • Mowing a lawn • Cleaning a house • Grading papers • Problems are easily divided into sub-problems; very little overhead

  10. Real World Problems • Not Inherently Parallel – parallelism is possible but more complex to define or with (excessive) overhead cost • Examples • Balancing a checkbook • Giving a haircut • Wallpapering a room • Prefix sums of an array

  11. Some Computer Problems • Are these “inherently parallel” or not? • Processing customers’ monthly bills • Payroll checks • Building student grade reports from class grade sheets • Searching for an item in a linked list • A video game program • Searching a state driver’s license database • Is problem hw, sw, data? Assumptions?

  12. General Observations • What characteristics make a problem inherently parallel? • What characteristics make a problem difficult to parallelize? * Consider hw, sw, data structures.

  13. Payroll Problem • Consider 10 PUs (processing unit) with each employee’s information stored in a row of an array, A. • Label P0, P1,…P9 • A[100] – 0 to 99 For i = 0 to 9 Pi process A[i*10] to A[((i+1)*10)-1]

  14. Code for Payroll For i = 0 to 9 Pi process A[i*10] to A[((i+1)*10)-1] Each PU runs a process in parallel For each Pi , i = 0 to 9 do //separate process For j = 0 to 9 Process A[i*10 + j]

  15. Time Complexity of Payroll Algorithm • Consider P processors • Consider N data items • Each PC has N/P data items • Assume data is accessible & writeable to each PC • Time for each P: O(N/P) • Work = P * O(N/P) = O(N) • Overhead??

  16. Payroll Questions?? • Now we have a solution, must be applied to hardware. Which hardware? • Main question: Where is the array and how is it accessed by each processor? • One shared memory or many local memories? • Where are the results placed?

  17. What about I/O?? • Generally, in parallel algorithms, I/O is disregarded. • Assumption: Data is stored in the available memory. • Assumption: Results are written back to memory. • Data input and output are generallyindependent of the processing algorithm.

  18. Balancing a Checkbook • Consider same hardware & data array • Can still distribute and process in the same manner as the payroll • Each block computes deposits as addition & checks a subtraction; totals the page (10 totals) • BUT then must combine the 10 totals to the final total • This is the overhead!

  19. Complexity of Checkbook • Consider P processors • Consider N data items • Each PC has N/P data items • Assume data is accessible & writeable • Time for each section: O(N/P) • Combination of P subtotals • Time for combining: O(P) to O(log P) • Depends on strategy used • Total: O(N/P + P) to O(N/P + log P)

  20. Performance Complexity - Perfect Parallel Algorithm • If the best sequential algorithm for a problem is O(f(x)) then the parallel algorithm would be O(f(x)/P) • This happens if little or no overhead Actual Run Time • Typically, takes 4 processors to achieve ½ the actual run time

  21. Performance Measures • Run Time: not a practical measurement Assume T1 & Tp are run times using 1 & p processors, respectively • Speedup: S = T1/Tp • Work: W = p * Tp (aka Cost) • If W = O(T1) the it is Work (Cost) Optimal & achieves Linear Speedup

  22. Scalability • An algorithm is said to be Scalable if performance increases linearly with the number of processors • i.e. double the processors, cut time in half • Implication: Algorithm sustains good performance over a wide range of processors.

  23. Scalability • What about continuing to add processors? • At what point does adding more processors stop improving the run time? • Does adding processors ever cause the algorithm to take more time? • What is the optimal number of processors? • Consider W = p * Tp = O(T1) • Solve for p • Want W to be optimal for different numbers of processors

  24. Models of Computation • Two major categories • Shared memory • e.g. PRAM • Fixed connection • e.g. Hypercube • There are numerous versions of each • Not all are totally realizable in hw

  25. Sidenote: Models • Distributed Computing • Use of 2 or more separate computers used to solve a single problem • A version of a network • Clusters • Not really a topic for this course

  26. Shared Memory Model • PRAM – parallel random access machine • A category with 4 variants • EREW-CREW-ERCW-CRCW • All communication through a shared global memory • Each PC has a small local memory

  27. Variants of PRAM • EREW-CREW-ERCW-CRCW • Concurrent read: 2 or more processors may read the same (or different) memory location simultaneously • Exclusive read: 2 or more processors may access global memory location only if each is accessing a unique address • Similarly defined for write

  28. Shared Memory Model M M M M P0 P1 P2 P3 Shared Global Memory

  29. Shared Memory • What are some implications of the variants in memory access of the PRAM model? • What is the strongest model?

  30. Fixed Connection Models • Each PU contains a Local Memory • Distributed memory • No shared memory • PUs are connected through some type of Interconnection Network • Interconnection network defines the model • Communication is via Message Passing • Can be synchronous or asynchronous

  31. Interconnection Networks • Bus Network (Linear) • Ring • Mesh • Torus • Hypercube

  32. Hypercube Model • Distributed memory, message passing, fixed connection, parallel computer • N = 2r number of nodes • E = r 2r-1 number of edges • Nodes are numbered 0 – N in binary such that any 2 nodes differing in one bit are connected by an edge • Dimension is r

  33. Hypercube ExamplesN = 2, 4 10 11 0 1 00 01 N = 2 Dimension = 1 N = 4 Dimension = 2

  34. Hypercube ExampleN = 8 111 110 N = 8 Dimension = 3 010 011 100 101 001 000

  35. Hypercube Considerations • Message Passing • Communication • Possible Delays • Load Balancing • Each PC has same work load • Data Distribution • Must follow connections

  36. Consider Checkbook Problem • How about distribution of data? • Often initial distribution is disregarded • What about the combination of the subtotals? • Reduction is by dimension • O(log P) = r

  37. Design Strategies • Paradigm: a general strategy used to aid in the development of the solution to a problem

  38. ParadigmsExtended from Sequential Use • Extended from sequential use • Divide-and-Conquer • Always used in parallel algorithms • Divide data vs. Divide code • Branch-and-Bound • Dynamic Programming

  39. ParadigmsDeveloped for Parallel Use • Deterministic coin tossing • Symmetry breaking • Accelerating cascades • Tree contraction • Euler Tours • Linked List Ranking • All Nearest Smaller Values (ANSV) • Parentheses Matching

  40. Divide-and-Conquer • Most basic parallel strategy • Used in virtually every parallel algorithm • Problem is divided into several sub-problems that can be solved independently; results of sub-problems are combined into the final solution • Example: Checkbook Problem

  41. Dynamic Programming • Divide & Conquer technique used when sub-problems are not independent; share common sub-problems • Sub-problem solutions are stored in table for use by other processes • Often used for optimization problems • Minimum or Maximum • Fibonacci Numbers

  42. Branch-and-Bound • Breadth-first tree processing technique • Uses a bounding function that allows some branches of the tree to be pruned (i.e. eliminated) • Example: Game programming

  43. Symmetry Breaking • Strategy that breaks a linked structure (e.g. linked list) into disjoint pieces for processing • Deterministic Coin Tossing • Using a binary representation of index, nonadjacent elements are selected for processing • Often used in Linked List Ranking Algorithms

  44. Accelerated Cascades • Applying 2 or more algorithms to a single problem, • Change from one to another based on the ratio of the problem size to the number of processors – Threshold • This “fine tuning” sometimes allows for better performance

  45. Tree Contraction (aka Contraction) • Nodes of a tree are removed; information removed is combined with remaining nodes • Multiple processors are assigned to independent nodes • Tree is reduced to a single node which contains the solution • E.G. Arithmetic Expression computation; Addition of a series of numbers

  46. Euler Tour • Create duplicate nodes in a tree or graph with edge in opposite direction to create a circuit • Allows tree or graph to be processed as a linked list

  47. Linked List Ranking • Halverson’s area of dissertation research • ProblemDefinition: • Given a linked list of elements, number the elements in order (or in reverse order) • For list of length 20+

  48. Linked List Ranking • Euler Tours • Tree Searches • List packing • Connectivity • Tree Traversals • Spanning Trees & Forests • Connected Components • Graph Decomposition Applied to a wide range of problems

  49. All Nearest Smaller Values • Given a sequence of values, for each value x, which predecessor elements are smaller than x • Successfully applied to • Depth first search of interval graph • Parentheses matching • Line Packing • Triangulating a monotone polynomial

  50. Parentheses Matching • In a properly formed string of parentheses, find the index of each parentheses mate • Applied to solve • Heights of all nodes in a tree • Extreme values in a tree • Lowest common ancestor • Balancing binary trees

More Related