Understanding GPU Architecture for Graphics Processing: A Comprehensive Overview

Graphics processors Norm Rubin – compiler architect – normanr@ati.com

Size of market • Many millions of gpu’s shipped per month • The 3d market is entertainment (games) • Each new generation of gpu adds enough performance to support a new version of a game. • Each time a game is released, player have to replace hardware to run the game. • Game industry is larger then Hollywood.

cpu gpu architecture Proprietary Commodity interfaces Mutable Locked down Technology view performance / function Not enough ok Too good

How much headroom • Pixar uses 100,000 min of compute per min of image • Gpu’s are real time so 100,000 = 20 doubles • Most optimistic marketing version of Moore’s law – performance doubles every 6 months • So there is 10 years to go.

Application space • Problems are embarrassingly parallel • Problems are big, screen 1000 x 1000, program runs per pixel, including some pixels that are behind others so 10* 1000 * 1000 calls per frame * 20-60 frames per second • Run the same program over and over so • Gpus are SIMD machines

SIMD • There are many units executing in parallel • These are in lock-step, executing the same instruction on different pixels/vertices at the same time • Dynamic flow control can cause inefficiencies in such an architecture since different pixels/vertices can take different code paths • Dynamic branching is not always a performance win • For an if…then…else, need to execute both sides, turning processors on and off.

Application space • Many values are coherent – values in neighbor pixels are close. • Compute coherent variables at selected points use interpolation to find the intermediate values • Today programmer specifies which variables are coherent by splitting programs in two.

Application space • Common subproblem is texture filtering • Evaluate some array of memory around a stencil and combine • Provide a small fixed set of stencil patterns in hardware • You could think of this as slighty smart memory • Hardware support for 1-3 d arrays and several filtering functions • Exact stencil patterns and combining operations are proprietary (some look better then others)

Application space • Little communication between processing elements • Approximate spatial derivative by 2x2 difference operator • Forces all machine designs to work on multiples of four pixels

Application space • Throughput is important • Use threading to cover latency • The chips can support hundreds of threads, and can switch from thread to thread every cycle • No thread switch overhead • Hardware scheduler and thread system • Compiler knows about threads and splits resources over threads • Caches are very different – can only cover spatial locality

Programming model • Performance is much less then users want • Min of 100,000 times less • Most developers write each program at least four times • Xbox • Playstation • Ati top machine • Nvidia top machine • Programs are in two parts: Vertex and Pixel shaders.

Programming model 2 • Programs could be written in a high level language (C like) HLSL/OGL2 • Or in virtual assembly language (DirectX, …) • Almost one dialect per chip • While virtual languages but physical resources. • developers review virtual machine listings for performance • developers ship virtual assembly language.

Programming model 3 • At game startup – virtual assembly language is JIT compiled to real machine language – • Drastic change in resource requirements • Somewhat hard to debug • Hard to identify performance bottlenecks • Even though applications could build code on the fly, developers pretest everything – they want the most performance to get the best looking image. Only approximate what they really want.

Programmable Pipeline Vertex Data (Model space) Fixed Function Transform andLighting Vertex Shader Geometry Stage Clipping and Viewport Mapping Texture Stages Pixel Shader Rasterizer Stage Fog, Alpha, Stencil Depth Testing

Per-Vertex Data Constants Position Normal Texture Coordinates Etc. View Matrix Projection Matrix Skin/Bone Matrices Light Positions Etc. Triangle Mesh Vertex Shader Engine Temporary Registers Vertex Shader Instructions Position “Texture” Coordinates Color(s) Vertex Processing Flow

Vertex Shader • Input: • Program specifies vertex data • Position • Normal • Vertex color • Texture coordinate(s) • … • Data is sent to the graphics card and processed by the vertex shader • Output • Vertex shader computes output quantities • Position • Vertex color: diffuse and specular • Texture coordinate(s) • Sent to rasterizer via interpolators

Interpolated Values Constants “Texture” Coordinates Color(s) Light Colors Ambient Lighting Colors Etc. Textures Pixel Shader Engine Pixel Shader Instructions Temporary Registers Color Multi-Render Target Pixel Processing Flow

Program sizes • Most programs are very small • 100 virtual instructions would be a large program • Basic data type is a four element vector of floats • Integer data types are not yet available • Dynamic branching is new • Small amount of nesting allowed

polygons • Polygon Budget • Ruby : 75,000 • Optico: 50,000 • Ninja: 25,000 • Environment: 100,000 • Props: 50,000 • Lighting Limits • 3 Dynamic lights per shot (1 shadow casting) • Lightmaps used for set • Animation Limits • 35 total blend shapes • 5 simultaneous blend shapes • 4 weighted bones per vertex • Number of on-screen characters limited to 4 at once

Shader Breakdown • Depth of Field • Hair • Skin

Depth Of Field

Shader Breakdown • Glows • Motion Blur • Reflections

Glows

Motion Blur

Reflections

Hardware view • X1900 • Xbox 360 • Both machines are current

X1900

Pixel Shader Processors Texture Address Units 1 texture address instructionsper unit per clock cycle Texture Address Unit 1 Texture Address Unit 2 Texture Address Unit 3 Texture Address Unit 4 Pixel Shader Processor Per Clock Cycle: 1 vec3 ADD + input modifier 1 scalar ADD + input modifier 1 vec3 ADD/MUL/MADD 1 scalar ADD/MUL/MADD 1 flow control instruction Quad Pixel Shader Core

Upgraded to support SM3.0 Dynamic flow control 1,024 instructions (practically unlimited with flow control) More temporary registers 8 Vertex Shader Processors Each can handle 2 shader instructions per clock 10 billion instructions per second Vertex Engine

Ring Bus Memory Controller • Supports today’s fastest graphics memory devices • GDDR3, 48+ GB/sec • GDDR4, The future • 512-bit Ring Bus • Simplifies layout and enables extreme memory clock scaling • New Cache Design • Fully Associative for more optimal performance • Improved Hyper Z • Better compression and hidden surface removal • Programmable Arbitration Logic • Maximizes memory efficiency • Can be upgraded via software

Memory Channels - 4x Improvement in Random Access over X850 Radeon X1900 8x32-bitchannels 8 Banks Per Dram RadeonX850 4x64-bitchannels 4 banks Per Dram

Cache Design • Fully Associative Caches • Cache lines can map to any location in external memory • Earlier designs used Direct Mapped & N-Way Associative Caches • Could only access limited blocks of external memory • Texture, Color, Z & Stencil caches are all now fully associative • Reduces memory bandwidth requirements • Minimizes cache contention stalls • Optimized game performance • Gains up to 25% clock for clock in fill/bandwidth bound cases GraphicsMemory Cache Direct Mapped Cache GraphicsMemory Cache Fully Associative Cache

Xbox • 3.2GHz Custom IBM Central Processor • Three CPU Cores • Two Threads Per core • VMX Unit Per Core • 128 VMX Registers Per Thread • 1MB L2 Cache (Lockable by Graphics Processor) • 500MHz Custom ATI Graphics Processor • Unified Shader Core • 48 ALU’s for Vertex or Pixel Shader processing • 16 Filtered & 16 Unfiltered Texture samples per clock • 10MB eDRAM Framebuffer • 512MB System RAM • Unified Memory Architecture (UMA) • 128-bit interface • 700MHz GDDR3 RAM

Z/Alpha/Stencil Processors 10MB DRAM Z/Alpha/Stencil Processors Architecture Memory Hub Texture Cache Texture Pipe Texture Pipe Texture Pipe Texture Pipe Command Processor Pipe Comm Shader Interp Shader Pipe (x16) Shader Pipe (x16) Shader Pipe (x16) Vertex Grouper Sequencer Shader Interp 256 GB/sec Primitive Assembly Scan Converter Vertex Cache

Adaptive Shader Array • Unified shader architecture • One processor type • Dynamic load balancing • Pixel and vertex processing where and when they’re needed • 48 shaders • 120 billion operations per second

Some interesting problems • Coherence (branch prediction?) • What are the right instructions • Can you do non graphics applications • Programming language • Threading by compiler • Off line compile?

Implications for programming languages • GPU – can convince people to use a new language if you can prove it is faster, even if it means lots of changes • Desktop CPU – have to prove it can meet some other (non-performance/function) need • Top of the line price for GPU going up- top of the line desktop CPU price going down, lots of change to do cool design. • Less need to be backward compatible.

More info • http://www.ati.com/developer/index.html

Understanding GPU Architecture for Graphics Processing: A Comprehensive Overview

Understanding GPU Architecture for Graphics Processing: A Comprehensive Overview

Presentation Transcript

Reformulating the WRF Model for Graphics Processors

Graphics processors

A Validation Methodology for Graphics Processors

Cryptography on Graphics Processors

A Practical Quicksort Algorithm for Graphics Processors

Accelerating Machine Learning Applications on Graphics Processors

High-Throughput Transaction Executions on Graphics Processors

On Dynamic Load Balancing on Graphics Processors

Graphics processors

Programming Massively Parallel Graphics Processors

An Evaluation of Graphics Processors as Stream Co-Processors

An Introduction to CUDA/OpenCL and Graphics Processors

Video Coding on Multi-core Graphics Processors

Tree-Based Density Clustering using Graphics Processors

Graphics Processors

Mars: A MapReduce Framework on Graphics Processors

Frequent Itemset Mining on Graphics Processors

Parallel Computing on Graphics Processors

Static Image Filtering on Commodity Graphics Processors

An Introduction to CUDA and Manycore Graphics Processors

A Validation Methodology for Graphics Processors