Which plane is better?

Harnessing Massively Parallel Processorshttp://www.ece.ubc.ca/~matei/Introduction to GPU Architecture and Programming Model Acknowledgement: some slides borrowed from presentations by Kayvon Fatahalian, Mark Harris, Samer Al-Kiswany

YVR to Paris Speed Passengers 10.5 hours 610 mph 470 5 hours 1350 mph 132 Which plane is better? Plane Boeing 747 Concorde

Same idea for GPUs • Specialized for data-intensive highly parallel computations • (exactly what the graphics hardware does well) • More transistors allocated to processing data rather than to caching and control flow (compared to CPUs)

Outline Hardware: GPU Architecture Intuition Software Programming Model Optimizations

GPU Architecture Intuition

Your data is not ready …

Storing contexts

(imagined)

nVidia(still idealized but closer to reality) NVIDIA-terminology • 480 stream processors (“CUDA cores”) • (15 multi-processors) • SIMT execution

NVIDIA GeForce GTX 480 (a multiprocessor) • A multiprocessor contains 32 cores • Two groups of threads (warps) are selected each clock (decode, fetch, execute two instruction streams in parallel) • Up to 48 warps are interleaved totalling 1536 CUDA threads CUDA ‘core’

So far: ProcessigNext: Accessing data

Summary so far Three major ideas (employed by all modern processors varying degrees) • Employ multiple processing cores • Simpler cores (embrace thread-level parallelism over ILP • Amortize instruction stream processing over cores (SIMD) • Increase compute capability with little extra cost • Use multi-threading to make more efficient use of processing resources (hide latencies, fill all available resources) Due to high arithmetic capability on modern chips, many parallel applications (on both CPUs and GPUs) are bandwidth bound GPUs push throughput computing concepts to extreme scales • Notable differences in memory system design

Program Flowand Host-Level Issues

GPU Architecture Multiprocessor 1 Shared Memory Instruction Unit Registers Registers Registers Processor 1 Processor 2 Processor M Host Machine Multiprocessor N GPU Multiprocessor 2 Host Constant Memory Texture Memory Global Memory

SIMD Architecture. • Four memories. • Device (a.k.a. global) • slow – 400-600 cycles access latency • large – 256MB – 1GB • Shared • fast– 4 cycles access latency • small – 128KB • Texture – read only • Constant – read only

GPU Architecture – Program Flow 1 2 4 5 1 2 3 4 5 TPreprocesing + TDataHtoG + TProcessing + TPostProc + TDataGtoH • Preprocessing • Data transfer in • GPU Processing • Data transfer out • Postprocessing 3 TTotal =

Outline Hardware Software Programming Model Optimizations

Add vectors

GPU Programming Model Programming Model: Software representation of the Hardware

GPU Programming Model Block Kernel: A function on the grid

GPU Programming Model

Which plane is better?

Which plane is better?

Presentation Transcript

Steep Turns

JEOPARDY!

Geometric Courses for Architectural and Teacher Students at the TU Graz O. Röschel

EEE 498/598 Overview of Electrical Engineering

3.5 ANSYS/LS-DYNA Enhancements

Stresses in Thin-walled Pressure Vessels (I)

In-Plane Tensile Properties

1443-501 Spring 2002 Lecture #24

Structured Vs. Object Oriented Analysis and Design SAD Vs. OOAD

The IP Data Plane: Packets and Routers

Simple machines

Chapter 12 Vectors

Null-field approach for multiple circular inclusion problems in anti-plane piezoelectricity

Plane Motion of Rigid Bodies: Forces and Accelerations

A review of multiaxial fatigue failure criteria based on the critical plane approach

Chap 6 Multi-view Projection

CS589-04 Digital Image Processing Lecture 2. Intensity Transformation and Spatial Filtering

Chapter 11 Areas of Plane Figures

Who is he? What did he invent ?

Military plane crashes in Indonesia