Background, Outline

Many-Core Programming with GRAMPS& “Real Time REYES”Jeremy Sugerman, Kayvon FatahalianStanford UniversityJune 12, 2008

Background, Outline • Stanford Graphics / Architecture Research • CPU, GPU trends • And collision? • Two research areas: • HW/SW Interface, Programming Model • Future Graphics API

Problem Statement • Drive efficient development and execution in many-/multi-core systems. • Support homogeneous, heterogeneous cores. • Inform future hardware Status Quo: • GPU Pipeline (Good for GL, otherwise hard) • CPU (No guidance, fast is hard)

= Thread Stage = Queue = Stage Output = Shader Stage = Fixed-func Stage GRAMPS Rasterization Pipeline Output Fragment Queue Input Fragment Queue • Software defined graphs • Producer-consumer, data-parallelism • Initial focus on rendering Frame Buffer FB Blend Shade Rasterize Ray Queue Ray Tracing Pipeline Camera Intersect Ray Hit Queue Fragment Queue Frame Buffer FB Blend Shade

As a GPU Evolution • Not (too) radical for ‘graphics’ • Like fixed → programmable shading • Pipeline undergoing massive shake up • Diversity of new parameters and use cases • Bigger picture than ‘graphics’ • Rendering is more than GL/D3D • Compute is more than rendering • Larrabee has no innate pipeline

As a Compute Evolution • Sounds like streaming: Execution graphs, kernels, data-parallelism • Streaming: “squeeze out every FLOP” • Goals: bulk transfer, arithmetic intensity • Intensive static analysis, custom chips (mostly) • Bounded space, data access, execution time • GRAMPS: “interesting apps are irregular” • Goals: Dynamic, data-dependent code • Aggregate work at run-time • Heterogeneous commodity platforms • Naturally supports streaming when applicable

GRAMPS’ Role • A ‘graphics pipeline’ is now an app! • GRAMPS models parallel state machines. • Compared to status quo: • More flexible than a GPU pipeline • More guidance than bare metal • Portability in between • Not domain specific

GRAMPS Interfaces • Host/Setup: Create execution graph • Thread: Stateful, singleton • Shader: Data-parallel, auto-instanced

What We’ve Built (System)

GRAMPS Scheduler • Tiered Scheduler • ‘Fat’ cores: per-thread, per-core • ‘Micro’ cores: shared hw scheduler • Top level: tier N

= Thread Stage = Queue = Stage Output = Shader Stage = Push Output = Fixed-func What We’ve Built (Apps) Direct3D Pipeline (with Ray-tracing Extension) Input Vertex Queue 1 Primitive Queue 1 Primitive Queue Fragment Queue Sample Queue Set Frame Buffer Vertex Buffers IA 1 VS 1 Rast PS OM RO … … Ray Queue IA N VS N Ray Hit Queue Primitive Queue N Input Vertex Queue N Trace PS2 Ray-tracing Extension Ray-tracing Pipeline Tile Queue Sample Queue Ray Queue Tiler Sampler Intersect Camera Ray Hit Queue Fragment Queue Frame Buffer FB Blend Shade

Initial Results • Queues are small, utilization is good

GRAMPS Visualization

GRAMPS Portability • Portability really means performance. • Less portable than GL/D3D • GRAMPS graph is hardware sensitive • More portable than bare metal • Enforces modularity • Best case, just works • Worst case, saves boilerplate

High-level Challenges • Is GRAMPS a suitable GPU evolution? • Enable pipeline competitive with bare metal? • Enable innovation: advanced / alternative methods? • Is GRAMPS a good parallel compute model? • Map well to hardware, hardware trends? • Support important apps? • Concepts influence developers?

What’s Next for GRAMPS? • Implementation: scheduling, simulation details • Model: Graph modification (state change) Blocking calls (join) Intra/inter-stage synchronization primitives Data sharing / ref-counting • Workloads: REYES, physics, others? • Develop new graphics pipelines…

“Real-Time REYES”

Just Build It Build a real-time REYES pipeline... … that is tightly integrated with ray tracing for global effects.

What does real-time REYES mean? (to us) • Smooth surfaces via adaptive tessellation • Everything is a displaced subdivision surface • Shade on surface, prior to rasterization • Stochastic rasterization for motion blur and DOF • Order-independent transparency

OpenGL/Direct3D REYES Split Tessellate (xbox) Dice Vertex Shade Displace Rasterize Early Z Early Z Shade Frag Shade Rasterize Z Test Z Test Blend/Resolve Blend/Resolve

REYES Tessellation Split primitive into smaller primitives until a “GOOD” grid can be created.

Grids Regular parametric sampling of primitive surface (like XBox360). Compact representation for many adjacent polygons. Grids provide SIMD efficiency and bulk processing benefits. GOOD GRID = - Max polygon area < 1 pixel - All polys about the same size - Bounded # polys per grid

REYES OpenGL/Direct3D Split Tessellate (xbox) Dice Vertex Shade Displace Rast Early Z Early Z Shade Frag Shade Rast/Crack Fix Z Test Z Test Blend/Resolve Blend/Resolve

What does real-time REYES mean? (to us) • Smooth surfaces via adaptive tessellation • Splitting is irregular (and serial) • Crack fixing • Shade on surface, prior to rasterization • We feel confident about this • But most “work” done before moving to raster space… hmm • Stochastic rasterization for motion blur and DOF • Many tiny polygons  parallel rasterization • SIMD tricky • Order-independent transparency • Not unique to REYES

Shading in a Hybrid System • Evaluate displacement (due to REYES or on demand for ray tracing) • Shade grids • Shade ray hits • Looking forward… shade quads too? One shading system or two or three?

This Project is Really About • Re-architecting REYES pipeline for real-time performance (for throughput architectures like LRB) • Hybrid rendering: study interoperability of advanced techniques (REYES + ray tracing + maybe Direct3D) • Hybrid shading system • Understand workload balance • Hybrid pipeline interface: real-time, retained mode • Pursuit of more flexible, advanced graphics pipelines

Questions?

Background, Outline