1 / 147

Day 2

Day 2. Agenda. Parallelism basics Parallel machines Parallelism again High Throughput Computing Finding the right grain size. One thing to remember. Easy. Hard. Seeking Concurrency. Data dependence graphs Data parallelism Functional parallelism Pipelining. Data Dependence Graph.

havyn
Download Presentation

Day 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Day 2

  2. Agenda • Parallelism basics • Parallel machines • Parallelism again • High Throughput Computing • Finding the right grain size

  3. One thing to remember Easy Hard

  4. Seeking Concurrency • Data dependence graphs • Data parallelism • Functional parallelism • Pipelining

  5. Data Dependence Graph • Directed graph • Vertices = tasks • Edges = dependences

  6. Data Parallelism • Independent tasks apply same operation to different elements of a data set • Okay to perform operations concurrently for i  0 to 99 do a[i]  b[i] + c[i] endfor

  7. Functional Parallelism • Independent tasks apply different operations to different data elements • First and second statements • Third and fourth statements a  2 b  3 m  (a + b) / 2 s  (a2 + b2) / 2 v  s - m2

  8. Pipelining • Divide a process into stages • Produce several items simultaneously

  9. Data Clustering • Data mining = looking for meaningful patterns in large data sets • Data clustering = organizing a data set into clusters of “similar” items • Data clustering can speed retrieval of related items

  10. Document Vectors Moon The Geology of Moon Rocks The Story of Apollo 11 A Biography of Jules Verne Alice in Wonderland Rocket

  11. Document Clustering

  12. Clustering Algorithm • Compute document vectors • Choose initial cluster centers • Repeat • Compute performance function • Adjust centers • Until function value converges or max iterations have elapsed • Output cluster centers

  13. Data Parallelism Opportunities • Operation being applied to a data set • Examples • Generating document vectors • Finding closest center to each vector • Picking initial values of cluster centers

  14. Functional Parallelism Opportunities • Draw data dependence diagram • Look for sets of nodes such that there are no paths from one node to another

  15. Data Dependence Diagram Build document vectors Choose cluster centers Compute function value Adjust cluster centers Output cluster centers

  16. Programming Parallel Computers • Extend compilers: translate sequential programs into parallel programs • Extend languages: add parallel operations • Add parallel language layer on top of sequential language • Define totally new parallel language and compiler system

  17. Strategy 1: Extend Compilers • Parallelizing compiler • Detect parallelism in sequential program • Produce parallel executable program • Focus on making Fortran programs parallel

  18. Extend Compilers (cont.) • Advantages • Can leverage millions of lines of existing serial programs • Saves time and labor • Requires no retraining of programmers • Sequential programming easier than parallel programming

  19. Extend Compilers (cont.) • Disadvantages • Parallelism may be irretrievably lost when programs written in sequential languages • Performance of parallelizing compilers on broad range of applications still up in air

  20. Extend Language • Add functions to a sequential language • Create and terminate processes • Synchronize processes • Allow processes to communicate

  21. Extend Language (cont.) • Advantages • Easiest, quickest, and least expensive • Allows existing compiler technology to be leveraged • New libraries can be ready soon after new parallel computers are available

  22. Extend Language (cont.) • Disadvantages • Lack of compiler support to catch errors • Easy to write programs that are difficult to debug

  23. Add a Parallel Programming Layer • Lower layer • Core of computation • Process manipulates its portion of data to produce its portion of result • Upper layer • Creation and synchronization of processes • Partitioning of data among processes • A few research prototypes have been built based on these principles

  24. Create a Parallel Language • Develop a parallel language “from scratch” • occam is an example • Add parallel constructs to an existing language • Fortran 90 • High Performance Fortran • C*

  25. New Parallel Languages (cont.) • Advantages • Allows programmer to communicate parallelism to compiler • Improves probability that executable will achieve high performance • Disadvantages • Requires development of new compilers • New languages may not become standards • Programmer resistance

  26. Current Status • Low-level approach is most popular • Augment existing language with low-level parallel constructs • MPI and OpenMP are examples • Advantages of low-level approach • Efficiency • Portability • Disadvantage: More difficult to program and debug

  27. Architectures • Interconnection networks • Processor arrays (SIMD/data parallel) • Multiprocessors (shared memory) • Multicomputers (distributed memory) • Flynn’s taxonomy

  28. Interconnection Networks • Uses of interconnection networks • Connect processors to shared memory • Connect processors to each other • Interconnection media types • Shared medium • Switched medium

  29. Shared versus Switched Media

  30. Shared Medium • Allows only message at a time • Messages are broadcast • Each processor “listens” to every message • Arbitration is decentralized • Collisions require resending of messages • Ethernet is an example

  31. Switched Medium • Supports point-to-point messages between pairs of processors • Each processor has its own path to switch • Advantages over shared media • Allows multiple messages to be sent simultaneously • Allows scaling of network to accommodate increase in processors

  32. Switch Network Topologies • View switched network as a graph • Vertices = processors or switches • Edges = communication paths • Two kinds of topologies • Direct • Indirect

  33. Direct Topology • Ratio of switch nodes to processor nodes is 1:1 • Every switch node is connected to • 1 processor node • At least 1 other switch node

  34. Indirect Topology • Ratio of switch nodes to processor nodes is greater than 1:1 • Some switches simply connect other switches

  35. Evaluating Switch Topologies • Diameter • Bisection width • Number of edges / node • Constant edge length? (yes/no)

  36. 2-D Mesh Network • Direct topology • Switches arranged into a 2-D lattice • Communication allowed only between neighboring switches • Variants allow wraparound connections between switches on edge of mesh

  37. 2-D Meshes

  38. Vector Computers • Vector computer: instruction set includes operations on vectors as well as scalars • Two ways to implement vector computers • Pipelined vector processor: streams data through pipelined arithmetic units • Processor array: many identical, synchronized arithmetic processing elements

  39. Why Processor Arrays? • Historically, high cost of a control unit • Scientific applications have data parallelism

  40. Processor Array

  41. Data/instruction Storage • Front end computer • Program • Data manipulated sequentially • Processor array • Data manipulated in parallel

  42. Processor Array Performance • Performance: work done per time unit • Performance of processor array • Speed of processing elements • Utilization of processing elements

  43. Performance Example 1 • 1024 processors • Each adds a pair of integers in 1 sec • What is performance when adding two 1024-element vectors (one per processor)?

  44. Performance Example 2 • 512 processors • Each adds two integers in 1 sec • Performance adding two vectors of length 600?

  45. 2-D Processor Interconnection Network Each VLSI chip has 16 processing elements

  46. if (COND) then A else B

  47. if (COND) then A else B

  48. if (COND) then A else B

  49. Processor Array Shortcomings • Not all problems are data-parallel • Speed drops for conditionally executed code • Don’t adapt to multiple users well • Do not scale down well to “starter” systems • Rely on custom VLSI for processors • Expense of control units has dropped

  50. Multicomputer, aka Distributed Memory Machines • Distributed memory multiple-CPU computer • Same address on different processors refers to different physical memory locations • Processors interact through message passing • Commercial multicomputers • Commodity clusters

More Related