630 likes | 746 Views
Harnessing Massively Parallel Processors http://www.ece.ubc.ca/~matei/ Introduction to GPU Architecture and Programming Model. Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, Mark Harris, Samer Al-Kiswany. YVR to Paris. Speed. Passengers. 10.5 hours. 610 mph.
E N D
Harnessing Massively Parallel Processorshttp://www.ece.ubc.ca/~matei/Introduction to GPU Architecture and Programming Model Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, Mark Harris, Samer Al-Kiswany
YVR to Paris Speed Passengers 10.5 hours 610 mph 470 5 hours 1350 mph 132 Which plane is better? Plane Boeing 747 Concorde
Same idea for GPUs • Specialized for data-intensive highly parallel computations • (exactly what the graphics hardware does well) • More transistors allocated to processing data rather than to caching and control flow (compared to CPUs)
Outline Hardware: GPU Architecture Intuition Software Programming Model Optimizations
nVidia(still idealized but closer to reality) NVIDIA-terminology • 480 stream processors (“CUDA cores”) • (15 multi-processors) • SIMT execution
NVIDIA GeForce GTX 480 (a multiprocessor) • A multiprocessor contains 32 cores • Two groups of threads (warps) are selected each clock (decode, fetch, execute two instruction streams in parallel) • Up to 48 warps are interleaved totalling 1536 CUDA threads CUDA ‘core’
Summary so far Three major ideas (employed by all modern processors varying degrees) • Employ multiple processing cores • Simpler cores (embrace thread-level parallelism over ILP • Amortize instruction stream processing over cores (SIMD) • Increase compute capability with little extra cost • Use multi-threading to make more efficient use of processing resources (hide latencies, fill all available resources) Due to high arithmetic capability on modern chips, many parallel applications (on both CPUs and GPUs) are bandwidth bound GPUs push throughput computing concepts to extreme scales • Notable differences in memory system design
GPU Architecture Multiprocessor 1 Shared Memory Instruction Unit Registers Registers Registers Processor 1 Processor 2 Processor M Host Machine Multiprocessor N GPU Multiprocessor 2 Host Constant Memory Texture Memory Global Memory
SIMD Architecture. • Four memories. • Device (a.k.a. global) • slow – 400-600 cycles access latency • large – 256MB – 1GB • Shared • fast– 4 cycles access latency • small – 128KB • Texture – read only • Constant – read only
GPU Architecture – Program Flow 1 2 4 5 1 2 3 4 5 TPreprocesing + TDataHtoG + TProcessing + TPostProc + TDataGtoH • Preprocessing • Data transfer in • GPU Processing • Data transfer out • Postprocessing 3 TTotal =
Outline Hardware Software Programming Model Optimizations
GPU Programming Model Programming Model: Software representation of the Hardware
GPU Programming Model Block Kernel: A function on the grid