CUDA Lecture 3 Parallel Architectures and Performance Analysis

CUDA Lecture 3Parallel Architectures and Performance Analysis Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Topic 1: Parallel Architectures • Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory: • Each main memory location located by its address. Addresses start at zero and extend to 2n – 1 when there are n bits (binary digits) in the address. Parallel Architectures and Performance Analysis – Slide 2

Parallel Computers • Parallel computer: multiple-processor system supporting parallel programming. • Three principle types of architecture • Vector computers, in particular processor arrays • Shared memory multiprocessors • Specially designed and manufactured systems • Distributed memory multicomputers • Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 3

Type 1: Vector Computers • Vector computer: instruction set includes operations on vectors as well as scalars • Two ways to implement vector computers • Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units • Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 4

Why Processor Arrays? • Historically, high cost of a control unit • Scientific applications have data parallelism Parallel Architectures and Performance Analysis – Slide 5

Data/Instruction Storage • Front end computer (standard uniprocessor) • Program • Data manipulated sequentially • Processor array (individual processor/memory pairs) • Data manipulated in parallel • Performance • Speed of processing elements • Utilization of processing elements • Size of data structure Parallel Architectures and Performance Analysis – Slide 6

2-D Processor Interconnection Network • Each VLSI chip has 16 processing elements Parallel Architectures and Performance Analysis – Slide 7

Processor Array Shortcomings • Not all problems are data parallel • Speed drops for conditionally executed code • Do not adapt to multiple users well • Do not scale down well to “starter” systems • Rely on custom VLSI for processors • Expense of control units has dropped Parallel Architectures and Performance Analysis – Slide 8

Type 2: Shared Memory Multiprocessor Systems • Natural way to extend single processor model • Have multiple processors connected to multiple memory modules such that each processor can access any memory module • So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 9

Ex: Quad Pentium Shared Memory Multiprocessor Parallel Architectures and Performance Analysis – Slide 10

Shared Memory Multiprocessor Systems • Any memory location can be accessible by any of the processors. • Asingle address spaceexists, meaning that each memory location is given unique address within a single range of addresses. • Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.). Parallel Architectures and Performance Analysis – Slide 11

Shared Memory Multiprocessor Systems (cont.) • Alternately known as a tightly coupled architecture. • No local memory associated with processors. • Avoid three problems of processor arrays • Can be built from commodity CPUs • Naturally support multiple users • Maintain efficiency in conditional code Parallel Architectures and Performance Analysis – Slide 12

Shared Memory Multiprocessor Systems (cont.) • Several alternatives for programming shared memory multiprocessors • Using threads (pthreads, Java, …) in which the programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads. • Using a sequential programming language with user-level libraries to declare and access shared variables. Parallel Architectures and Performance Analysis – Slide 13

Shared Memory Multiprocessor Systems (cont.) • Several alternatives for programming shared memory multiprocessors • Using a sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism. • Ex: OpenMP – the industry standard • An API for shared-memory systems • Supports higher performance parallel programming of symmetrical multiprocessors Parallel Architectures and Performance Analysis – Slide 14

Shared Memory Multiprocessor Systems (cont.) • Several alternatives for programming shared memory multiprocessors • Using a parallel programming language with syntax for parallelism, in which the compiler creates the appropriate executable code for each processor. • Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code. • Neither of these not now common. Parallel Architectures and Performance Analysis – Slide 15

Fundamental Types of Shared Memory Multiprocessor • Type 1: Centralized Multiprocessor • Straightforward extension of uniprocessor • Add CPUs to bus • All processors share same primary memory • Memory access time same for all CPUs • An example of a uniform memory access (UMA) multiprocessor • Symmetrical multiprocessor (SMP) Parallel Architectures and Performance Analysis – Slide 16

Centralized Multiprocessor Parallel Architectures and Performance Analysis – Slide 17

Private and Shared Data • Private data: items used only by a single processor • Shared data: values used by multiple processors • In a centralized multiprocessor, processors communicate via shared data values • Problems associated with shared data • Cache coherence • Replicating data across multiple caches reduces contention • How to ensure different processors have same value for same address? • Synchronization • Mutual exclusion • Barriers Parallel Architectures and Performance Analysis – Slide 18

Distributed Shared Memory • Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing). • Then can use shared memory programming techniques. Parallel Architectures and Performance Analysis – Slide 19

Fundamental Types of Shared Memory Multiprocessor • Type 2: Distributed Multiprocessor • Distribute primary memory among processors • Increase aggregate memory bandwidth and lower average memory access time • Allow greater number of processors • Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 20

Distributed Multiprocessor Parallel Architectures and Performance Analysis – Slide 21

Cache Coherence • Some NUMA multiprocessors do not support it in hardware • Only instructions, private data in cache • Large memory access time variance • Implementations more difficult • No shared memory bus to “snoop” • Directory-based protocol needed Parallel Architectures and Performance Analysis – Slide 22

Directory-Based Protocol • Distributed directory contains information about cacheable memory blocks • One directory entry for each cache block • Each entry has • Sharing status • Uncached: block not in any processor’s cache • Shared: cached by one or more processors; read only • Exclusive: cached by exactly one processor which has written block, so copy in memory obsolete • Which processors have copies Parallel Architectures and Performance Analysis – Slide 23

Type 3: Message-Passing Multicomputers • Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 24

Multicomputers • Distributed memory multiple-CPU computer • Same address on different processors refers to different physical memory locations • Processors interact through message passing • Commercial multicomputers • Commodity clusters Parallel Architectures and Performance Analysis – Slide 25

Loosely Coupled Architectures • Alternate name for message-passing multicomputer systems. • Each processor has its own memory accessible only to that processor. • A message passing interconnection network provides point-to-point connections among processors. • Memory access varies between processors. Parallel Architectures and Performance Analysis – Slide 26

Asymmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 27

Asymmetrical Multicomputer • Advantages: • Back-end processors dedicated to parallel computations • Easier to understand, model, tune performance • Only a simple back-end operating system needed • Easy for a vendor to create • Disadvantages: • Front-end computer is a single point of failure • Single front-end computer limits scalability of system • Primitive operating system in back-end processors makes debugging difficult • Every application requires development of both front-end and back-end programs Parallel Architectures and Performance Analysis – Slide 28

Symmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 29

Symmetrical Multicomputer • Advantages: • Alleviate performance bottleneck caused by single front-end computer • Better support for debugging • Every processor executes same program • Disadvantages: • More difficult to maintain illusion of single “parallel computer” • No simple way to balance program development workload among processors • More difficult to achieve high performance when multiple processes on each processor Parallel Architectures and Performance Analysis – Slide 30

ParPar Cluster: A Mixed Model Parallel Architectures and Performance Analysis – Slide 31

Alternate System: Flynn’s Taxonomy • Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams. • Also important are number of processors, number of programs which can be executed, and the memory structure. Parallel Architectures and Performance Analysis – Slide 32

Flynn’s Taxonomy: SISD • Single instruction stream, single data stream (SISD) computer • In a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items. • The single CPU executes one instruction at a time and fetches or stores one item of data at a time. Parallel Architectures and Performance Analysis – Slide 33

Flynn’s Taxonomy: SISD (cont.) Control Signals Arithmetic Processor Control unit Results Instruction Data Stream Memory Parallel Architectures and Performance Analysis – Slide 34

Flynn’s Taxonomy: SIMD • Single instruction stream, multiple data stream (SIMD) computer • A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist. • The instructions from the program are broadcast to more than one processor. • Each processor executes the same instruction in synchronism, but using different data. • Developed because there are a number of important applications that mostly operate upon arrays of data. Parallel Architectures and Performance Analysis – Slide 35

Flynn’s Taxonomy: SIMD (cont.) Control Unit Control Signal PE 2 PE n PE 1 Data Stream 1 Data Stream 2 Data Stream n Parallel Architectures and Performance Analysis – Slide 36

SIMD Architectures • Processing distributed over a large amount of hardware. • Operates concurrently on many different data elements. • Performs the same computation on all data elements. • Processors operate synchronously. • Examples: pipelined vector processors (e.g. Cray-1) and processor arrays (e.g. Connection Machine) Parallel Architectures and Performance Analysis – Slide 37

SISD vs. SIMD Execution X 1 X 1 All PEs Yes a=0 ? X 2 PEs satisfy a = 0, others are idle X 2 No X 3 PEs satisfy a ≠ 0, others are idle X 3 X 4 SISD machine All PEs X 4 SIMD machine Parallel Architectures and Performance Analysis – Slide 38

Flynn’s Taxonomy: MISD • Multiple instruction stream, single data stream (MISD) computer • MISD machines may execute several different programs on the same data item. • There are two categories • Distinct processing units perform distinct instructions on the same data. Currently there is no such machine. • Pipelined architectures, where data flows through a series of processing elements. Parallel Architectures and Performance Analysis – Slide 39

Flynn’s Taxonomy: MISD (cont.) Instruction Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Control Unit 2 Processing Element 2 Data Stream Instruction Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 40

MISD Architectures • A pipeline processor works according to the principle of pipelining. • A process can be broken down into several stages (segments). • While one stage is executing, another stage is being loaded and the input of one stage is the output of the previous stage. • The processor carries out many different computations concurrently. • Example: systolic array Parallel Architectures and Performance Analysis – Slide 41

MISD Architectures (cont.) Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage. Pipelined execution of the same two processes. T = 5 t Parallel Architectures and Performance Analysis – Slide 42

Flynn’s Taxonomy: MIMD • Multiple instruction stream, multiple data stream (MIMD) computer • General purpose multiprocessor system. • Multiple processors, each with a separate (different) program operating on its own data. • One instruction stream is generated from each program for each processor. • Each instruction operates upon different data. • Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification. Parallel Architectures and Performance Analysis – Slide 43

Flynn’s Taxonomy: MIMD (cont.) Instruction Stream 1 Data Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Data Stream 2 Control Unit 2 Processing Element 2 Instruction Stream n Data Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 44

MIMD Architectures • Processing distributed over a number of processors operating independently and concurrently. • Resources (memory) shared among processors. • Each processor runs its own program. • MIMD systems execute operations in a parallel asynchronous fashion. Parallel Architectures and Performance Analysis – Slide 45

MIMD Architectures (cont.) • Differ with regard to • Interconnection networks • Memory addressing techniques • Synchronization • Control structures • A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently. Parallel Architectures and Performance Analysis – Slide 46

Two MIMD Structures: MPMD • Multiple Program Multiple Data (MPMD) Structure • Within the MIMD classification, which we are concerned with, each processor will have its own program to execute. Parallel Architectures and Performance Analysis – Slide 47

Two MIMD Structures: SPMD • Single Program Multiple Data (SPMD) Structure • Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism. • The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. • Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware. Parallel Architectures and Performance Analysis – Slide 48

SIMD vs. MIMD • SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit. • SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor. • SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD. Parallel Architectures and Performance Analysis – Slide 49

SIMD vs. MIMD (cont.) • MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism. • For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful. Parallel Architectures and Performance Analysis – Slide 50

CUDA Lecture 3 Parallel Architectures and Performance Analysis