Parallel Architectures & Performance Analysis

Parallel Architectures& Performance Analysis Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Parallel Computers • Parallel computer: multiple-processor system supporting parallel programming. • Three principle types of architecture • Vector computers, in particular processor arrays • Shared memory multiprocessors • Specially designed and manufactured systems • Distributed memory multicomputers • Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 2

Type 1: Vector Computers • Vector computer: instruction set includes operations on vectors as well as scalars • Two ways to implement vector computers • Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units • Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 3

Type 2: Shared Memory Multiprocessor Systems • Natural way to extend single processor model • Have multiple processors connected to multiple memory modules such that each processor can access any memory module • So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 4

Ex: Quad Pentium Shared Memory Multiprocessor Parallel Architectures and Performance Analysis – Slide 5

Fundamental Types of Shared Memory Multiprocessor • Type 2: Distributed Multiprocessor • Distribute primary memory among processors • Increase aggregate memory bandwidth and lower average memory access time • Allow greater number of processors • Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 6

Distributed Multiprocessor Parallel Architectures and Performance Analysis – Slide 7

Type 3: Message-Passing Multicomputers • Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 8

Multicomputers • Distributed memory multiple-CPU computer • Same address on different processors refers to different physical memory locations • Processors interact through message passing • Commercial multicomputers • Commodity clusters Parallel Architectures and Performance Analysis – Slide 9

Asymmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 10

Symmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 11

ParPar Cluster: A Mixed Model Parallel Architectures and Performance Analysis – Slide 12

Alternate System: Flynn’s Taxonomy • Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams. • Also important are number of processors, number of programs which can be executed, and the memory structure. Parallel Architectures and Performance Analysis – Slide 13

Flynn’s Taxonomy: SISD (cont.) Control Signals Arithmetic Processor Control unit Results Instruction Data Stream Memory Parallel Architectures and Performance Analysis – Slide 14

Flynn’s Taxonomy: SIMD (cont.) Control Unit Control Signal PE 2 PE n PE 1 Data Stream 1 Data Stream 2 Data Stream n Parallel Architectures and Performance Analysis – Slide 15

Flynn’s Taxonomy: MISD (cont.) Instruction Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Control Unit 2 Processing Element 2 Data Stream Instruction Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 16

MISD Architectures (cont.) Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage. Pipelined execution of the same two processes. T = 5 t Parallel Architectures and Performance Analysis – Slide 17

Flynn’s Taxonomy: MIMD (cont.) Instruction Stream 1 Data Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Data Stream 2 Control Unit 2 Processing Element 2 Instruction Stream n Data Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 18

Two MIMD Structures: MPMD • Multiple Program Multiple Data (MPMD) Structure • Within the MIMD classification, which we are concerned with, each processor will have its own program to execute. Parallel Architectures and Performance Analysis – Slide 19

Two MIMD Structures: SPMD • Single Program Multiple Data (SPMD) Structure • Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism. • The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. • Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware. Parallel Architectures and Performance Analysis – Slide 20

Topic 1 Summary • Architectures • Vector computers • Shared memory multiprocessors: tightly coupled • Centralized/symmetrical multiprocessor (SMP): UMA • Distributed multiprocessor: NUMA • Distributed memory/message-passing multicomputers: loosely coupled • Asymmetrical vs. symmetrical • Flynn’s Taxonomy • SISD, SIMD, MISD, MIMD (MPMD, SPMD) Parallel Architectures and Performance Analysis – Slide 21

Topic 2: Performance Measures and Analysis • A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input. • The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements. Parallel Architectures and Performance Analysis – Slide 22

Speedup Factor • The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel. • The speedup factor of a parallel computation utilizing p processors is defined as the following ratio: • In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time. Parallel Architectures and Performance Analysis – Slide 23

Speedup Factor (cont.) • Speedup factor can also be cast in terms of computational steps: • Maximum speedup is (usually) p with p processors (linear speedup). Parallel Architectures and Performance Analysis – Slide 24

Execution Time Components • Given a problem of size n on p processors let • Inherently sequential computations (n) • Potentially parallel computations (n) • Communication operations (n,p) • Then: Parallel Architectures and Performance Analysis – Slide 25

Speedup Plot “elbowing out” Number of processors  Parallel Architectures and Performance Analysis – Slide 26

Efficiency • The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system: • Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation. Parallel Architectures and Performance Analysis – Slide 27

Analysis of Efficiency • Since E = S(p)/p, by what we did earlier • Since all terms are positive, E > 0 • Furthermore, since the denominator is larger than the numerator, E < 1 Parallel Architectures and Performance Analysis – Slide 28

Maximum Speedup: Amdahl’s Law Parallel Architectures and Performance Analysis – Slide 29

Amdahl’s Law (cont.) • As before since the communication time must be non-trivial. • Let f represent the inherently sequential portion of the computation; then Parallel Architectures and Performance Analysis – Slide 30

Amdahl’s Law (cont.) • Limitations • Ignores communication time • Overestimates speedup achievable • Amdahl Effect • Typically (n,p) has lower complexity than (n)/p • So as p increases, (n)/p dominates (n,p) • Thus as p increases, speedup increases Parallel Architectures and Performance Analysis – Slide 31

Gustafson-Barsis’ Law • As before • Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then Parallel Architectures and Performance Analysis – Slide 32

Gustafson-Barsis’ Law (cont.) • Then Parallel Architectures and Performance Analysis – Slide 33

Gustafson-Barsis’ Law (cont.) • Begin with parallel execution time instead of sequential time • Estimate sequential execution time to solve same problem • Problem size is an increasing function of p • Predicts scaled speedup Parallel Architectures and Performance Analysis – Slide 34

Limitations • Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time • Both overestimate speedup or scaled speedup achievable Gene Amdahl John L. Gustafson Parallel Architectures and Performance Analysis – Slide 35

Topic 2 Summary • Performance terms: speedup, efficiency • Model of speedup: serial, parallel and communication components • What prevents linear speedup? • Serial and communication operations • Process start-up • Imbalanced workloads • Architectural limitations • Analyzing parallel performance • Amdahl’s Law • Gustafson-Barsis’ Law Parallel Architectures and Performance Analysis – Slide 36

End Credits • Based on original material from • The University of Akron: Tim O’Neil, Kathy Liszka • Hiram College: Irena Lomonosov • The University of North Carolina at Charlotte • Barry Wilkinson, Michael Allen • Oregon State University: Michael Quinn • Revision history: last updated 7/28/2011. Parallel Architectures and Performance Analysis – Slide 37

Parallel Architectures & Performance Analysis