1 / 18

Introduction

Introduction . Manjusha Nair M Amrita School of Biotechnology, Amrita University. Contents What is Parallel Computing? Concepts and Terminology CPU Vs GPU CUDA Programming Model CUDA Memory model CUDA Architecture Threads, Blocks,Grids. I. What is Parallel Computing?.

zayit
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction Manjusha Nair M Amrita School of Biotechnology, Amrita University.

  2. Contents • What is Parallel Computing? • Concepts and Terminology • CPU Vs GPU • CUDA Programming Model • CUDA Memory model • CUDA Architecture • Threads, Blocks,Grids

  3. I. What is Parallel Computing? • Traditionally, software has been written for serial computation: • To be run on a single computer having a single Central Processing Unit (CPU); • A problem is broken into a discrete series of instructions. • Instructions are executed one after another. • Only one instruction may execute at any moment in time. • parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs

  4. The computational problem should be able to: • Be broken apart into discrete pieces of work that can be solved simultaneously; • Execute multiple program instructions at any moment in time; • Be solved in less time with multiple compute resources than with a single compute resource. • The compute resources might be: • A single computer with multiple processors; • An arbitrary number of computers connected by a network; • A combination of both. The Universe is Parallel: Human Brain

  5. II. Concepts and Terminology von Neumann Architecture Flynn's Classical Taxonomy John von Neumann first authored the general requirements for an electronic computer in his 1945 papers. SISD: Non parallel SIMD: GPUs employ this MISD: Very few implementations e.g, multiple cryptographic algo. to crack a key MIMD: Computing clusters The basic, fundamental architecture remains the same, even in parallel computers , just multiplied in units

  6. CPU vs GPU More transistors are devoted to data processing rather than data caching and flow control Multiprocessor cores in GPU are SIMD cores Cores execute the same instructions simultaneously Higher memory bandwidth- GPUs have memory controllers Can process several thousand threads simultaneously. GPUs can maintain up to 1024 threads per each multiprocessor GPUs switch several threads per cycle • Frequency growth is now limited by physical matters and high power consumption. • Performance is often raised by increasing the number of cores • Use SISD or MIMD in multi cores . Each core works independently of the others executing various instructions for various processes. • CPUs can execute 1-2 threads per core • Switching from one thread to another costs hundreds of cycles to CPUs

  7. Details on GPUs • GPU is typically a computer card, installed into a PCI Express 16x slot • GPGPU - General-Purpose computation on Graphics Processing Units • GPUs lead the race for floating-point performance since start of 21st century • Market leaders: NVIDIA, Intel, AMD NVIDIA GPUs GeForce GTX 480 Tesla 2070 Tesla D870

  8. GPGPU & CUDA • GPU designed as a numeric computing engine • Will not perform well on some tasks as CPUs • Most applications will use both CPUs and GPUs • CUDA • NVIDIA’s parallel computing architecture aimed at increasing computing performance by harnessing the power of the GPU • A programming model

  9. CUDA Programming model • CUDA is NVIDA’s solution to access the GPU • Can be seen as an extension to C/C++ • To a CUDA programmer, the computing system consist of • a host , a traditional CPU • and one or more devices, which are massively parallel processors (with large number of ALUs) • Host (CPU part) • -Single Program, Single Data • Device (GPU part) • -Single Program, Multiple Data CUDA Software Stack

  10. . . . . . . Serial Code (host)‏ Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args); • Data-parallel portions of an application are expressed as device kernels which run on many threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few

  11. CUDA Memory Model : Overview • Global memory • Main means of communicating R/W Data between host and device • Contents visible to all threads • Long latency access Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory • The Grid • A group of threads all running • the same kernel • Can run multiple grids at once Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory The Block Grids composed of blocks Each block is a logical unit containing a number of coordinating threads and some amount of shared memory

  12. Local Memory • Each thread has own local storage • Data lifetime = thread lifetime • Shared Memory • Each thread block has own shared memory • Accessible only by threads within that block • Data lifetime = block lifetime • Global (device) memory • Accessible by all threads as well as host (CPU) • Data lifetime = from allocation to deallocation. • Host (CPU) memory • Not directly accessible by CUDA threads

  13. CUDA Architecture

  14. Arrays of Parallel Threads threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … • A CUDA kernel is executed by an array ofthreads • All threads run the same code (SPMD)‏ • Each thread has an ID that it uses to compute memory addresses and make control decisions

  15. 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … … float x = input[threadID]; float y = func(x); output[threadID] = y; … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Blocks: Scalable Cooperation • Divide monolithic thread array into multiple blocks • Threads within a block cooperate via shared memory, atomic operations and barrier synchronization • Threads in different blocks cannot cooperate Thread Block 1 Thread Block N - 1 Thread Block 0 …

  16. Grids, Blocks and Threads • A grid of size 6 (3x2 blocks) • Each block has 12 threads (4x3)

  17. Block IDs and Thread IDs • Each thread uses IDs to decide what data to work on • Block ID: 1D or 2D • Thread ID: 1D, 2D, or 3D • Simplifies memoryaddressing when processingmultidimensional data • Image processing • Solving PDEs on volumes • … 17

  18. Different Levels of parallelism • Thread parallelism • each thread is an independent thread of execution • Data parallelism • across threads in a block • across blocks in a kernel • Task parallelism • different blocks are independent • independent kernels

More Related