380 likes | 566 Views
Parallel Architectures & Performance Analysis. Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Parallel Computers. Parallel computer: multiple-processor system supporting parallel programming. Three principle types of architecture
E N D
Parallel Architectures& Performance Analysis Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Parallel Computers • Parallel computer: multiple-processor system supporting parallel programming. • Three principle types of architecture • Vector computers, in particular processor arrays • Shared memory multiprocessors • Specially designed and manufactured systems • Distributed memory multicomputers • Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 2
Type 1: Vector Computers • Vector computer: instruction set includes operations on vectors as well as scalars • Two ways to implement vector computers • Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units • Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 3
Type 2: Shared Memory Multiprocessor Systems • Natural way to extend single processor model • Have multiple processors connected to multiple memory modules such that each processor can access any memory module • So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 4
Ex: Quad Pentium Shared Memory Multiprocessor Parallel Architectures and Performance Analysis – Slide 5
Fundamental Types of Shared Memory Multiprocessor • Type 2: Distributed Multiprocessor • Distribute primary memory among processors • Increase aggregate memory bandwidth and lower average memory access time • Allow greater number of processors • Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 6
Distributed Multiprocessor Parallel Architectures and Performance Analysis – Slide 7
Type 3: Message-Passing Multicomputers • Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 8
Multicomputers • Distributed memory multiple-CPU computer • Same address on different processors refers to different physical memory locations • Processors interact through message passing • Commercial multicomputers • Commodity clusters Parallel Architectures and Performance Analysis – Slide 9
Asymmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 10
Symmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 11
ParPar Cluster: A Mixed Model Parallel Architectures and Performance Analysis – Slide 12
Alternate System: Flynn’s Taxonomy • Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams. • Also important are number of processors, number of programs which can be executed, and the memory structure. Parallel Architectures and Performance Analysis – Slide 13
Flynn’s Taxonomy: SISD (cont.) Control Signals Arithmetic Processor Control unit Results Instruction Data Stream Memory Parallel Architectures and Performance Analysis – Slide 14
Flynn’s Taxonomy: SIMD (cont.) Control Unit Control Signal PE 2 PE n PE 1 Data Stream 1 Data Stream 2 Data Stream n Parallel Architectures and Performance Analysis – Slide 15
Flynn’s Taxonomy: MISD (cont.) Instruction Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Control Unit 2 Processing Element 2 Data Stream Instruction Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 16
MISD Architectures (cont.) Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage. Pipelined execution of the same two processes. T = 5 t Parallel Architectures and Performance Analysis – Slide 17
Flynn’s Taxonomy: MIMD (cont.) Instruction Stream 1 Data Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Data Stream 2 Control Unit 2 Processing Element 2 Instruction Stream n Data Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 18
Two MIMD Structures: MPMD • Multiple Program Multiple Data (MPMD) Structure • Within the MIMD classification, which we are concerned with, each processor will have its own program to execute. Parallel Architectures and Performance Analysis – Slide 19
Two MIMD Structures: SPMD • Single Program Multiple Data (SPMD) Structure • Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism. • The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. • Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware. Parallel Architectures and Performance Analysis – Slide 20
Topic 1 Summary • Architectures • Vector computers • Shared memory multiprocessors: tightly coupled • Centralized/symmetrical multiprocessor (SMP): UMA • Distributed multiprocessor: NUMA • Distributed memory/message-passing multicomputers: loosely coupled • Asymmetrical vs. symmetrical • Flynn’s Taxonomy • SISD, SIMD, MISD, MIMD (MPMD, SPMD) Parallel Architectures and Performance Analysis – Slide 21
Topic 2: Performance Measures and Analysis • A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input. • The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements. Parallel Architectures and Performance Analysis – Slide 22
Speedup Factor • The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel. • The speedup factor of a parallel computation utilizing p processors is defined as the following ratio: • In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time. Parallel Architectures and Performance Analysis – Slide 23
Speedup Factor (cont.) • Speedup factor can also be cast in terms of computational steps: • Maximum speedup is (usually) p with p processors (linear speedup). Parallel Architectures and Performance Analysis – Slide 24
Execution Time Components • Given a problem of size n on p processors let • Inherently sequential computations (n) • Potentially parallel computations (n) • Communication operations (n,p) • Then: Parallel Architectures and Performance Analysis – Slide 25
Speedup Plot “elbowing out” Number of processors Parallel Architectures and Performance Analysis – Slide 26
Efficiency • The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system: • Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation. Parallel Architectures and Performance Analysis – Slide 27
Analysis of Efficiency • Since E = S(p)/p, by what we did earlier • Since all terms are positive, E > 0 • Furthermore, since the denominator is larger than the numerator, E < 1 Parallel Architectures and Performance Analysis – Slide 28
Maximum Speedup: Amdahl’s Law Parallel Architectures and Performance Analysis – Slide 29
Amdahl’s Law (cont.) • As before since the communication time must be non-trivial. • Let f represent the inherently sequential portion of the computation; then Parallel Architectures and Performance Analysis – Slide 30
Amdahl’s Law (cont.) • Limitations • Ignores communication time • Overestimates speedup achievable • Amdahl Effect • Typically (n,p) has lower complexity than (n)/p • So as p increases, (n)/p dominates (n,p) • Thus as p increases, speedup increases Parallel Architectures and Performance Analysis – Slide 31
Gustafson-Barsis’ Law • As before • Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then Parallel Architectures and Performance Analysis – Slide 32
Gustafson-Barsis’ Law (cont.) • Then Parallel Architectures and Performance Analysis – Slide 33
Gustafson-Barsis’ Law (cont.) • Begin with parallel execution time instead of sequential time • Estimate sequential execution time to solve same problem • Problem size is an increasing function of p • Predicts scaled speedup Parallel Architectures and Performance Analysis – Slide 34
Limitations • Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time • Both overestimate speedup or scaled speedup achievable Gene Amdahl John L. Gustafson Parallel Architectures and Performance Analysis – Slide 35
Topic 2 Summary • Performance terms: speedup, efficiency • Model of speedup: serial, parallel and communication components • What prevents linear speedup? • Serial and communication operations • Process start-up • Imbalanced workloads • Architectural limitations • Analyzing parallel performance • Amdahl’s Law • Gustafson-Barsis’ Law Parallel Architectures and Performance Analysis – Slide 36
End Credits • Based on original material from • The University of Akron: Tim O’Neil, Kathy Liszka • Hiram College: Irena Lomonosov • The University of North Carolina at Charlotte • Barry Wilkinson, Michael Allen • Oregon State University: Michael Quinn • Revision history: last updated 7/28/2011. Parallel Architectures and Performance Analysis – Slide 37