1 / 85

CUDA Lecture 3 Parallel Architectures and Performance Analysis

CUDA Lecture 3 Parallel Architectures and Performance Analysis. Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Topic 1: Parallel Architectures. Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory:

cloris
Download Presentation

CUDA Lecture 3 Parallel Architectures and Performance Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUDA Lecture 3Parallel Architectures and Performance Analysis Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

  2. Topic 1: Parallel Architectures • Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory: • Each main memory location located by its address. Addresses start at zero and extend to 2n – 1 when there are n bits (binary digits) in the address. Parallel Architectures and Performance Analysis – Slide 2

  3. Parallel Computers • Parallel computer: multiple-processor system supporting parallel programming. • Three principle types of architecture • Vector computers, in particular processor arrays • Shared memory multiprocessors • Specially designed and manufactured systems • Distributed memory multicomputers • Message passing systems readily formed from a cluster of workstations Parallel Architectures and Performance Analysis – Slide 3

  4. Type 1: Vector Computers • Vector computer: instruction set includes operations on vectors as well as scalars • Two ways to implement vector computers • Pipelined vector processor (e.g. Cray): streams data through pipelined arithmetic units • Processor array: many identical, synchronized arithmetic processing elements Parallel Architectures and Performance Analysis – Slide 4

  5. Why Processor Arrays? • Historically, high cost of a control unit • Scientific applications have data parallelism Parallel Architectures and Performance Analysis – Slide 5

  6. Data/Instruction Storage • Front end computer (standard uniprocessor) • Program • Data manipulated sequentially • Processor array (individual processor/memory pairs) • Data manipulated in parallel • Performance • Speed of processing elements • Utilization of processing elements • Size of data structure Parallel Architectures and Performance Analysis – Slide 6

  7. 2-D Processor Interconnection Network • Each VLSI chip has 16 processing elements Parallel Architectures and Performance Analysis – Slide 7

  8. Processor Array Shortcomings • Not all problems are data parallel • Speed drops for conditionally executed code • Do not adapt to multiple users well • Do not scale down well to “starter” systems • Rely on custom VLSI for processors • Expense of control units has dropped Parallel Architectures and Performance Analysis – Slide 8

  9. Type 2: Shared Memory Multiprocessor Systems • Natural way to extend single processor model • Have multiple processors connected to multiple memory modules such that each processor can access any memory module • So-called shared memory configuration: Parallel Architectures and Performance Analysis – Slide 9

  10. Ex: Quad Pentium Shared Memory Multiprocessor Parallel Architectures and Performance Analysis – Slide 10

  11. Shared Memory Multiprocessor Systems • Any memory location can be accessible by any of the processors. • Asingle address spaceexists, meaning that each memory location is given unique address within a single range of addresses. • Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.). Parallel Architectures and Performance Analysis – Slide 11

  12. Shared Memory Multiprocessor Systems (cont.) • Alternately known as a tightly coupled architecture. • No local memory associated with processors. • Avoid three problems of processor arrays • Can be built from commodity CPUs • Naturally support multiple users • Maintain efficiency in conditional code Parallel Architectures and Performance Analysis – Slide 12

  13. Shared Memory Multiprocessor Systems (cont.) • Several alternatives for programming shared memory multiprocessors • Using threads (pthreads, Java, …) in which the programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads. • Using a sequential programming language with user-level libraries to declare and access shared variables. Parallel Architectures and Performance Analysis – Slide 13

  14. Shared Memory Multiprocessor Systems (cont.) • Several alternatives for programming shared memory multiprocessors • Using a sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism. • Ex: OpenMP – the industry standard • An API for shared-memory systems • Supports higher performance parallel programming of symmetrical multiprocessors Parallel Architectures and Performance Analysis – Slide 14

  15. Shared Memory Multiprocessor Systems (cont.) • Several alternatives for programming shared memory multiprocessors • Using a parallel programming language with syntax for parallelism, in which the compiler creates the appropriate executable code for each processor. • Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code. • Neither of these not now common. Parallel Architectures and Performance Analysis – Slide 15

  16. Fundamental Types of Shared Memory Multiprocessor • Type 1: Centralized Multiprocessor • Straightforward extension of uniprocessor • Add CPUs to bus • All processors share same primary memory • Memory access time same for all CPUs • An example of a uniform memory access (UMA) multiprocessor • Symmetrical multiprocessor (SMP) Parallel Architectures and Performance Analysis – Slide 16

  17. Centralized Multiprocessor Parallel Architectures and Performance Analysis – Slide 17

  18. Private and Shared Data • Private data: items used only by a single processor • Shared data: values used by multiple processors • In a centralized multiprocessor, processors communicate via shared data values • Problems associated with shared data • Cache coherence • Replicating data across multiple caches reduces contention • How to ensure different processors have same value for same address? • Synchronization • Mutual exclusion • Barriers Parallel Architectures and Performance Analysis – Slide 18

  19. Distributed Shared Memory • Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing). • Then can use shared memory programming techniques. Parallel Architectures and Performance Analysis – Slide 19

  20. Fundamental Types of Shared Memory Multiprocessor • Type 2: Distributed Multiprocessor • Distribute primary memory among processors • Increase aggregate memory bandwidth and lower average memory access time • Allow greater number of processors • Also called non-uniform memory access (NUMA) multiprocessor Parallel Architectures and Performance Analysis – Slide 20

  21. Distributed Multiprocessor Parallel Architectures and Performance Analysis – Slide 21

  22. Cache Coherence • Some NUMA multiprocessors do not support it in hardware • Only instructions, private data in cache • Large memory access time variance • Implementations more difficult • No shared memory bus to “snoop” • Directory-based protocol needed Parallel Architectures and Performance Analysis – Slide 22

  23. Directory-Based Protocol • Distributed directory contains information about cacheable memory blocks • One directory entry for each cache block • Each entry has • Sharing status • Uncached: block not in any processor’s cache • Shared: cached by one or more processors; read only • Exclusive: cached by exactly one processor which has written block, so copy in memory obsolete • Which processors have copies Parallel Architectures and Performance Analysis – Slide 23

  24. Type 3: Message-Passing Multicomputers • Complete computers connected through an interconnection network Parallel Architectures and Performance Analysis – Slide 24

  25. Multicomputers • Distributed memory multiple-CPU computer • Same address on different processors refers to different physical memory locations • Processors interact through message passing • Commercial multicomputers • Commodity clusters Parallel Architectures and Performance Analysis – Slide 25

  26. Loosely Coupled Architectures • Alternate name for message-passing multicomputer systems. • Each processor has its own memory accessible only to that processor. • A message passing interconnection network provides point-to-point connections among processors. • Memory access varies between processors. Parallel Architectures and Performance Analysis – Slide 26

  27. Asymmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 27

  28. Asymmetrical Multicomputer • Advantages: • Back-end processors dedicated to parallel computations • Easier to understand, model, tune performance • Only a simple back-end operating system needed • Easy for a vendor to create • Disadvantages: • Front-end computer is a single point of failure • Single front-end computer limits scalability of system • Primitive operating system in back-end processors makes debugging difficult • Every application requires development of both front-end and back-end programs Parallel Architectures and Performance Analysis – Slide 28

  29. Symmetrical Multicomputer Parallel Architectures and Performance Analysis – Slide 29

  30. Symmetrical Multicomputer • Advantages: • Alleviate performance bottleneck caused by single front-end computer • Better support for debugging • Every processor executes same program • Disadvantages: • More difficult to maintain illusion of single “parallel computer” • No simple way to balance program development workload among processors • More difficult to achieve high performance when multiple processes on each processor Parallel Architectures and Performance Analysis – Slide 30

  31. ParPar Cluster: A Mixed Model Parallel Architectures and Performance Analysis – Slide 31

  32. Alternate System: Flynn’s Taxonomy • Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams. • Also important are number of processors, number of programs which can be executed, and the memory structure. Parallel Architectures and Performance Analysis – Slide 32

  33. Flynn’s Taxonomy: SISD • Single instruction stream, single data stream (SISD) computer • In a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items. • The single CPU executes one instruction at a time and fetches or stores one item of data at a time. Parallel Architectures and Performance Analysis – Slide 33

  34. Flynn’s Taxonomy: SISD (cont.) Control Signals Arithmetic Processor Control unit Results Instruction Data Stream Memory Parallel Architectures and Performance Analysis – Slide 34

  35. Flynn’s Taxonomy: SIMD • Single instruction stream, multiple data stream (SIMD) computer • A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist. • The instructions from the program are broadcast to more than one processor. • Each processor executes the same instruction in synchronism, but using different data. • Developed because there are a number of important applications that mostly operate upon arrays of data. Parallel Architectures and Performance Analysis – Slide 35

  36. Flynn’s Taxonomy: SIMD (cont.) Control Unit Control Signal PE 2 PE n PE 1 Data Stream 1 Data Stream 2 Data Stream n Parallel Architectures and Performance Analysis – Slide 36

  37. SIMD Architectures • Processing distributed over a large amount of hardware. • Operates concurrently on many different data elements. • Performs the same computation on all data elements. • Processors operate synchronously. • Examples: pipelined vector processors (e.g. Cray-1) and processor arrays (e.g. Connection Machine) Parallel Architectures and Performance Analysis – Slide 37

  38. SISD vs. SIMD Execution X 1 X 1 All PEs Yes a=0 ? X 2 PEs satisfy a = 0, others are idle X 2 No X 3 PEs satisfy a ≠ 0, others are idle X 3 X 4 SISD machine All PEs X 4 SIMD machine Parallel Architectures and Performance Analysis – Slide 38

  39. Flynn’s Taxonomy: MISD • Multiple instruction stream, single data stream (MISD) computer • MISD machines may execute several different programs on the same data item. • There are two categories • Distinct processing units perform distinct instructions on the same data. Currently there is no such machine. • Pipelined architectures, where data flows through a series of processing elements. Parallel Architectures and Performance Analysis – Slide 39

  40. Flynn’s Taxonomy: MISD (cont.) Instruction Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Control Unit 2 Processing Element 2 Data Stream Instruction Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 40

  41. MISD Architectures • A pipeline processor works according to the principle of pipelining. • A process can be broken down into several stages (segments). • While one stage is executing, another stage is being loaded and the input of one stage is the output of the previous stage. • The processor carries out many different computations concurrently. • Example: systolic array Parallel Architectures and Performance Analysis – Slide 41

  42. MISD Architectures (cont.) Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage. Pipelined execution of the same two processes. T = 5 t Parallel Architectures and Performance Analysis – Slide 42

  43. Flynn’s Taxonomy: MIMD • Multiple instruction stream, multiple data stream (MIMD) computer • General purpose multiprocessor system. • Multiple processors, each with a separate (different) program operating on its own data. • One instruction stream is generated from each program for each processor. • Each instruction operates upon different data. • Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification. Parallel Architectures and Performance Analysis – Slide 43

  44. Flynn’s Taxonomy: MIMD (cont.) Instruction Stream 1 Data Stream 1 Control Unit 1 Processing Element 1 Instruction Stream 2 Data Stream 2 Control Unit 2 Processing Element 2 Instruction Stream n Data Stream n Control Unit n Processing Element n Parallel Architectures and Performance Analysis – Slide 44

  45. MIMD Architectures • Processing distributed over a number of processors operating independently and concurrently. • Resources (memory) shared among processors. • Each processor runs its own program. • MIMD systems execute operations in a parallel asynchronous fashion. Parallel Architectures and Performance Analysis – Slide 45

  46. MIMD Architectures (cont.) • Differ with regard to • Interconnection networks • Memory addressing techniques • Synchronization • Control structures • A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently. Parallel Architectures and Performance Analysis – Slide 46

  47. Two MIMD Structures: MPMD • Multiple Program Multiple Data (MPMD) Structure • Within the MIMD classification, which we are concerned with, each processor will have its own program to execute. Parallel Architectures and Performance Analysis – Slide 47

  48. Two MIMD Structures: SPMD • Single Program Multiple Data (SPMD) Structure • Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism. • The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer. • Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware. Parallel Architectures and Performance Analysis – Slide 48

  49. SIMD vs. MIMD • SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit. • SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor. • SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD. Parallel Architectures and Performance Analysis – Slide 49

  50. SIMD vs. MIMD (cont.) • MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism. • For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful. Parallel Architectures and Performance Analysis – Slide 50

More Related