840 likes | 1.04k Views
Parallel Concept and Hardware Architecture CUDA Programming Model Overview. Yukai Hung Department of Mathematics National Taiwan University. Parallel Concept Overview. making your program faster. Parallel Computing Goals. Solve the problems in less time
E N D
Parallel Concept and Hardware ArchitectureCUDA Programming Model Overview Yukai HungDepartment of MathematicsNational Taiwan University
Parallel Concept Overview making your program faster
Parallel Computing Goals • Solve the problems in less time • - divide one problem into smaller pieces • - solve smaller problems concurrently • - allow to solve more bigger problems • Prepare to parallelize one problem • - represent algorithms as Directed Acyclic Graphs • - identify dependencies in the problems • - identify critical paths in the algorithms • - modify dependencies to shorten the critical paths 3
Parallel Computing Goals • What is parallel computing? 4
Amdahl’s Law • Speedup of a parallel program is limited by amount of serial works 5
Amdahl’s Law • Speedup of a parallel program is limited by amount of serial works 6
Race Condition • Consider the following parallel program • - threads are almost impossibly executed at the same time 7
Race Condition • Scenario 1 • - the result value R is 2 if the initial value R is 1 8
Race Condition • Scenario 2 • - the result value R is 2 if the initial value R is 1 9
Race Condition • Scenario 3 • - the result value R is 3 if the initial value R is 1 10
Race Condition with Lock • Solve the race condition by Locking • - manage the shared resource between threads • - avoid the deadlock or unbalanced problems 11
Race Condition with Lock • Guarantee the executed instruction order is correct • - the problem is back to the sequential procedure • - lock and release procedure have high overhead 12
Race Condition with Semaphore • Solve the race condition by Semaphore • - multi-value locking method (binary locking extension) • - instructions in procedure P and V are atomic operations 13
Instruction Level Parallelism • Multiple instructions are executed simultaneously • - reorder the instructions carefully to get efficiency • - compilerreorders the assemble instructions automatically step 1 step 2 step 3 14
Data Level Parallelism • Multiple data operations are executed simultaneously • - computational data is separable and independent • - single operation repeated over different input data sequential procedure parallel procedure 15
Flynn’s Taxonomy • Classification for parallel computers and programs 16
Flynn’s Taxonomy • Classification for parallel computers and programs SISD SIMD 17
Flynn’s Taxonomy • Classification for parallel computers and programs MISD MIMD 18
CPU versus GPU Intel Penryn quad-core 255mm2 in 0.82B transistors NVIDIA GTX280 >500mm2 in 1.4B transistors 21
CPU versus GPU computing GFLOPS memory bandwidth 22
CPU versus GPU GPU CPU comparison control/cache size and number of ALUs comparison the clock rates/core numbers/executing latency 23
General Purpose GPU Computation algorithm conversion requires the knowledge of graphic APIs (OpenGL and DirectX) 25
General Purpose GPU Computation convert from pixel data to required data restrict the general usage of algorithms 26
Simplified Graphic Pipeline sorting stage z-buffer collection 28
Simplified Graphic Pipeline maximum depth z-cull feedback 29
SimplifiedGraphic Pipeline scale up some units 30
Simplified Graphic Pipeline add framebuffer access bottleneck is the FBI unit for memory management 31
SimplifiedGraphic Pipeline add programmability by ALU units add programmable geometry and pixel shaders 32
Simplified Graphic Pipeline consider two similar units special case on the pipeline (1) one trigonometry but lots of pixels: pixel shader is busy (2) lots of trigonometry but one pixel: geometry shader is busy 33
Iterative Graphic Pipeline combine two units to unified shader scalable between geometries and pixels memory resource management becomes important 34
Graphic Pipeline Comparison software pipeline hardware pipeline 35
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Unified Graphic Architecture • Switch on two modes: Graphic and CUDA mode 36
Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Global Memory Unified Graphic Architecture • Switch on two modes: Graphic and CUDA mode Load/store Load/store Load/store Load/store Load/store 37
Thread Streaming Processing no data communication issue which is suitable for traditional graphical issue 38
Shader Register File/Cache • Separate register files • - strict streaming processing mode • - no data sharing at instruction-level • - dynamical allocation and renaming • - not exist memory addressing order • - registers overflow into local memory 39
Thread Streaming Processing communication issues between different threads (in the same shader) 40
Shader Register File/Cache • Shared register files • - extra memory hierarchy between • shader registers and global memory • - share data in the same shader • or threads in the same block • - synchronize all threads in the shader 41
Nowadays GPU Applications • Traditional game rendering • - clothing demo and star tales benchmark • - real time physical phenomenon rendering • - mixed mode for physical phenomenon simulation • Scientific computational usage • - molecular dynamic simulation • - protein folding and nbody simulation • - medical images and computational fluid dynamics • CUDA community showcase 45
Nowadays GPU Features • GPUs are becoming more programmable than before • - only standard C extension on unified scalable shaders • GPUs now support 32-bit and 64-bit floating point operations • - almost IEEE floating point compliant except for some specials • - lack mantissa denormalizationfor small floating point number • GPUs have much higher memory bandwidth than CPUs • - multiple memory bank driven by need of high-performance graphics • Massive data-level parallel architecture • - hundreds of thread processors on the chip • - thousands of concurrent threads on the shader • - lightweight thread switch and long latency memory 46
General Purpose GPU Environment • CUDA: Compute Unified Device Architecture • - realistic hardware and software GPGPU solution • - minimal set of standard C language extensions • - tool set includes compiler and software development kits • OpenCL: Open Computing Language • - similar to CUDA from GPGPU points of view • - support both CPU and GPU hardware architecture • - execute across heterogeneous platform resources 47
. . . . . . CUDA Programming Model • Integrated host and device application C program • - serial or modestly parallel parts in host C code • - highly parallel parts in device C extension code 49
CUDA Programming Model • What is the computed device? • - coprocessor to the host part • - have its own device memory space • - run many active threads in parallel • What is the difference between CPU and GPU threads? • - GPU threads are extremely lightweight • - GPU threads have almost no creating overhead • - GPU needs more than 1000 threads for full occupancy • - multiple core CPU can execute or create only a little threads 50