Parallel Concept and Hardware Architecture CUDA Programming Model Overview

Parallel Concept and Hardware ArchitectureCUDA Programming Model Overview Yukai HungDepartment of MathematicsNational Taiwan University

Parallel Concept Overview making your program faster

Parallel Computing Goals • Solve the problems in less time • - divide one problem into smaller pieces • - solve smaller problems concurrently • - allow to solve more bigger problems • Prepare to parallelize one problem • - represent algorithms as Directed Acyclic Graphs • - identify dependencies in the problems • - identify critical paths in the algorithms • - modify dependencies to shorten the critical paths 3

Parallel Computing Goals • What is parallel computing? 4

Amdahl’s Law • Speedup of a parallel program is limited by amount of serial works 5

Amdahl’s Law • Speedup of a parallel program is limited by amount of serial works 6

Race Condition • Consider the following parallel program • - threads are almost impossibly executed at the same time 7

Race Condition • Scenario 1 • - the result value R is 2 if the initial value R is 1 8

Race Condition with Lock • Solve the race condition by Locking • - manage the shared resource between threads • - avoid the deadlock or unbalanced problems 11

Race Condition with Lock • Guarantee the executed instruction order is correct • - the problem is back to the sequential procedure • - lock and release procedure have high overhead 12

Race Condition with Semaphore • Solve the race condition by Semaphore • - multi-value locking method (binary locking extension) • - instructions in procedure P and V are atomic operations 13

Instruction Level Parallelism • Multiple instructions are executed simultaneously • - reorder the instructions carefully to get efficiency • - compilerreorders the assemble instructions automatically step 1 step 2 step 3 14

Data Level Parallelism • Multiple data operations are executed simultaneously • - computational data is separable and independent • - single operation repeated over different input data sequential procedure parallel procedure 15

Flynn’s Taxonomy • Classification for parallel computers and programs 16

Flynn’s Taxonomy • Classification for parallel computers and programs SISD SIMD 17

Flynn’s Taxonomy • Classification for parallel computers and programs MISD MIMD 18

CPU and GPU Hardware Comparison

CPU versus GPU 20

CPU versus GPU Intel Penryn quad-core 255mm2 in 0.82B transistors NVIDIA GTX280 >500mm2 in 1.4B transistors 21

CPU versus GPU computing GFLOPS memory bandwidth 22

CPU versus GPU GPU CPU comparison control/cache size and number of ALUs comparison the clock rates/core numbers/executing latency 23

General Purpose GPU Computation 24

General Purpose GPU Computation algorithm conversion requires the knowledge of graphic APIs (OpenGL and DirectX) 25

General Purpose GPU Computation convert from pixel data to required data restrict the general usage of algorithms 26

Start from Traditional GPU Architecture to Nowadays

Simplified Graphic Pipeline sorting stage z-buffer collection 28

Simplified Graphic Pipeline maximum depth z-cull feedback 29

SimplifiedGraphic Pipeline scale up some units 30

Simplified Graphic Pipeline add framebuffer access bottleneck is the FBI unit for memory management 31

SimplifiedGraphic Pipeline add programmability by ALU units add programmable geometry and pixel shaders 32

Simplified Graphic Pipeline consider two similar units special case on the pipeline (1) one trigonometry but lots of pixels: pixel shader is busy (2) lots of trigonometry but one pixel: geometry shader is busy 33

Iterative Graphic Pipeline combine two units to unified shader scalable between geometries and pixels memory resource management becomes important 34

Graphic Pipeline Comparison software pipeline hardware pipeline 35

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Unified Graphic Architecture • Switch on two modes: Graphic and CUDA mode 36

Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Global Memory Unified Graphic Architecture • Switch on two modes: Graphic and CUDA mode Load/store Load/store Load/store Load/store Load/store 37

Thread Streaming Processing no data communication issue which is suitable for traditional graphical issue 38

Shader Register File/Cache • Separate register files • - strict streaming processing mode • - no data sharing at instruction-level • - dynamical allocation and renaming • - not exist memory addressing order • - registers overflow into local memory 39

Thread Streaming Processing communication issues between different threads (in the same shader) 40

Shader Register File/Cache • Shared register files • - extra memory hierarchy between • shader registers and global memory • - share data in the same shader • or threads in the same block • - synchronize all threads in the shader 41

Unified Graphic Architecture 42

CPU and GPU Chip Architecture 43

CPU and GPU Chip Architecture 44

Nowadays GPU Applications • Traditional game rendering • - clothing demo and star tales benchmark • - real time physical phenomenon rendering • - mixed mode for physical phenomenon simulation • Scientific computational usage • - molecular dynamic simulation • - protein folding and nbody simulation • - medical images and computational fluid dynamics • CUDA community showcase 45

Nowadays GPU Features • GPUs are becoming more programmable than before • - only standard C extension on unified scalable shaders • GPUs now support 32-bit and 64-bit floating point operations • - almost IEEE floating point compliant except for some specials • - lack mantissa denormalizationfor small floating point number • GPUs have much higher memory bandwidth than CPUs • - multiple memory bank driven by need of high-performance graphics • Massive data-level parallel architecture • - hundreds of thread processors on the chip • - thousands of concurrent threads on the shader • - lightweight thread switch and long latency memory 46

General Purpose GPU Environment • CUDA: Compute Unified Device Architecture • - realistic hardware and software GPGPU solution • - minimal set of standard C language extensions • - tool set includes compiler and software development kits • OpenCL: Open Computing Language • - similar to CUDA from GPGPU points of view • - support both CPU and GPU hardware architecture • - execute across heterogeneous platform resources 47

CUDA Programming Overview

. . . . . . CUDA Programming Model • Integrated host and device application C program • - serial or modestly parallel parts in host C code • - highly parallel parts in device C extension code 49

CUDA Programming Model • What is the computed device? • - coprocessor to the host part • - have its own device memory space • - run many active threads in parallel • What is the difference between CPU and GPU threads? • - GPU threads are extremely lightweight • - GPU threads have almost no creating overhead • - GPU needs more than 1000 threads for full occupancy • - multiple core CPU can execute or create only a little threads 50

Parallel Concept and Hardware Architecture CUDA Programming Model Overview

Parallel Concept and Hardware Architecture CUDA Programming Model Overview

Presentation Transcript

Overview of Parallel Architecture

Parallel Computation Architecture, Algorithm and Programming

Parallel Programming with CUDA

CUDA Programming,

Parallel Processing: Architecture Overview

Using The CUDA Programming Model

CUDA Programming Model Overview Hardware Hierarchy and Optimization

CUDA Programming Model Overview Memory Hierarchy and Optimization

CUDA Programming

Programming with CUDA and Parallel Algorithms

SOA Programming Model and Physical Architecture Model

CUDA Programming

Overview of Parallel Architecture

Parallel GPU Programming with NVIDIA Cuda

CUDA parallel programming technology

Parallel Processing: Architecture Overview

CUDA Programming Model