180 likes | 309 Views
Introduction . Manjusha Nair M Amrita School of Biotechnology, Amrita University. Contents What is Parallel Computing? Concepts and Terminology CPU Vs GPU CUDA Programming Model CUDA Memory model CUDA Architecture Threads, Blocks,Grids. I. What is Parallel Computing?.
E N D
Introduction Manjusha Nair M Amrita School of Biotechnology, Amrita University.
Contents • What is Parallel Computing? • Concepts and Terminology • CPU Vs GPU • CUDA Programming Model • CUDA Memory model • CUDA Architecture • Threads, Blocks,Grids
I. What is Parallel Computing? • Traditionally, software has been written for serial computation: • To be run on a single computer having a single Central Processing Unit (CPU); • A problem is broken into a discrete series of instructions. • Instructions are executed one after another. • Only one instruction may execute at any moment in time. • parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs
The computational problem should be able to: • Be broken apart into discrete pieces of work that can be solved simultaneously; • Execute multiple program instructions at any moment in time; • Be solved in less time with multiple compute resources than with a single compute resource. • The compute resources might be: • A single computer with multiple processors; • An arbitrary number of computers connected by a network; • A combination of both. The Universe is Parallel: Human Brain
II. Concepts and Terminology von Neumann Architecture Flynn's Classical Taxonomy John von Neumann first authored the general requirements for an electronic computer in his 1945 papers. SISD: Non parallel SIMD: GPUs employ this MISD: Very few implementations e.g, multiple cryptographic algo. to crack a key MIMD: Computing clusters The basic, fundamental architecture remains the same, even in parallel computers , just multiplied in units
CPU vs GPU More transistors are devoted to data processing rather than data caching and flow control Multiprocessor cores in GPU are SIMD cores Cores execute the same instructions simultaneously Higher memory bandwidth- GPUs have memory controllers Can process several thousand threads simultaneously. GPUs can maintain up to 1024 threads per each multiprocessor GPUs switch several threads per cycle • Frequency growth is now limited by physical matters and high power consumption. • Performance is often raised by increasing the number of cores • Use SISD or MIMD in multi cores . Each core works independently of the others executing various instructions for various processes. • CPUs can execute 1-2 threads per core • Switching from one thread to another costs hundreds of cycles to CPUs
Details on GPUs • GPU is typically a computer card, installed into a PCI Express 16x slot • GPGPU - General-Purpose computation on Graphics Processing Units • GPUs lead the race for floating-point performance since start of 21st century • Market leaders: NVIDIA, Intel, AMD NVIDIA GPUs GeForce GTX 480 Tesla 2070 Tesla D870
GPGPU & CUDA • GPU designed as a numeric computing engine • Will not perform well on some tasks as CPUs • Most applications will use both CPUs and GPUs • CUDA • NVIDIA’s parallel computing architecture aimed at increasing computing performance by harnessing the power of the GPU • A programming model
CUDA Programming model • CUDA is NVIDA’s solution to access the GPU • Can be seen as an extension to C/C++ • To a CUDA programmer, the computing system consist of • a host , a traditional CPU • and one or more devices, which are massively parallel processors (with large number of ALUs) • Host (CPU part) • -Single Program, Single Data • Device (GPU part) • -Single Program, Multiple Data CUDA Software Stack
. . . . . . Serial Code (host) Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args); Serial Code (host) Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args); • Data-parallel portions of an application are expressed as device kernels which run on many threads • Differences between GPU and CPU threads • GPU threads are extremely lightweight • Very little creation overhead • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few
CUDA Memory Model : Overview • Global memory • Main means of communicating R/W Data between host and device • Contents visible to all threads • Long latency access Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory • The Grid • A group of threads all running • the same kernel • Can run multiple grids at once Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Host Global Memory The Block Grids composed of blocks Each block is a logical unit containing a number of coordinating threads and some amount of shared memory
Local Memory • Each thread has own local storage • Data lifetime = thread lifetime • Shared Memory • Each thread block has own shared memory • Accessible only by threads within that block • Data lifetime = block lifetime • Global (device) memory • Accessible by all threads as well as host (CPU) • Data lifetime = from allocation to deallocation. • Host (CPU) memory • Not directly accessible by CUDA threads
Arrays of Parallel Threads threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … • A CUDA kernel is executed by an array ofthreads • All threads run the same code (SPMD) • Each thread has an ID that it uses to compute memory addresses and make control decisions
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 threadID … float x = input[threadID]; float y = func(x); output[threadID] = y; … … float x = input[threadID]; float y = func(x); output[threadID] = y; … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Blocks: Scalable Cooperation • Divide monolithic thread array into multiple blocks • Threads within a block cooperate via shared memory, atomic operations and barrier synchronization • Threads in different blocks cannot cooperate Thread Block 1 Thread Block N - 1 Thread Block 0 …
Grids, Blocks and Threads • A grid of size 6 (3x2 blocks) • Each block has 12 threads (4x3)
Block IDs and Thread IDs • Each thread uses IDs to decide what data to work on • Block ID: 1D or 2D • Thread ID: 1D, 2D, or 3D • Simplifies memoryaddressing when processingmultidimensional data • Image processing • Solving PDEs on volumes • … 17
Different Levels of parallelism • Thread parallelism • each thread is an independent thread of execution • Data parallelism • across threads in a block • across blocks in a kernel • Task parallelism • different blocks are independent • independent kernels